Abstract
In modern high-dimensional data sets, feature selection is an essential pre-processing step for many statistical modelling tasks. The field of cost-sensitive feature selection extends the concepts of feature selection by introducing so-called feature costs. These do not necessarily relate to financial costs, but can be seen as a general construct to numerically valuate any disfavored aspect of a feature, like for example the run-time of a measurement procedure, or the patient harm of a biomarker test. There are multiple ideas to define a cost-sensitive feature selection setup. The strategy applied in this thesis is to introduce an additive cost-budget as an upper bound of the total costs. This extends the standard feature selection problem by an additional constraint on the sum of costs for included features. Main areas of research in this field include adaptations of standard feature selection algorithms to account for this additional constraint. However, cost-aware selection criteria also play an important role for the overall performance of these methods and need to be discussed in detail as well.
This cumulative dissertation summarizes the work of three papers in this field. Two of these introduce new methods for cost-sensitive feature selection with a fixed budget constraint. The other discusses a common trade-off criterion of performance and cost. For this criterion, an analysis of the selection outcome in different setups revealed a reduction of the ability to distinguish between information and noise. This can for example be counteracted by introducing a hyperparameter in the criterion. The presented research on new cost-sensitive methods comprises adaptations of Greedy Forward Selection, Genetic Algorithms, filter approaches and a novel Random Forest based algorithm, which selects individual trees from a low-cost tree ensemble. Central concepts of each method are discussed and thorough simulation studies to evaluate individual strengths and weaknesses are provided. Every simulation study includes artificial, as well as real-world data examples to validate results in a broad context. Finally, all chapters present discussions with practical recommendations on the application of the proposed methods and conclude with an outlook on possible further research for the respective topics.