Feature selection is an essential step in data science pipelines to reduce the complexity of models trained on large datasets. While a major part of feature selection research focuses on optimizing predictive performance, there are only few studies that investigate the integration of feature selection stability into the feature selection process. Taking advantage of feature selection stability has the potential to enhance interpretability of machine learning models whilst maintaining predictive performance. In this study we present the RENT feature selector for binary classification and regression problems. The proposed methodology is based on an ensemble of elastic net regularized models, trained on unique subsets of the dataset. RENT selects features based on three criteria evaluating the weight distributions of features across all elementary models. Compared to conventional approaches, RENT simultaneously performs high-quality feature selection while gathering useful information for model interpretation. In addition, the proposed ensemble-based selection criteria guarantee robustness of the model by selecting features with high stability. In an experimental evaluation, we compare feature selection quality on eight multivariate datasets: six for binary classification and two for regression. We benchmark RENT against six established feature selectors. In terms of both, number of features selected and predictive performance, RENT delivers on-par results with the best performing competitors. The additional information on stability provided by RENT can be integrated in an exploratory post-hoc analysis for further insight as demonstrated in a use-case from the healthcare domain.
翻译:特征选择选择是数据科学管道的一个必要步骤,目的是降低在大型数据集方面受过培训的模型的复杂性。特征选择研究的主要部分侧重于优化预测性能,但调查将特征选择稳定性纳入特征选择过程的研究很少。利用特征选择稳定性的优势,有可能提高机器学习模型的解释性,同时保持预测性能。在本研究中,我们介绍了用于二进制分类和回归问题的 RENT 特征选择器。拟议方法基于弹性网常规化模型的组合,该模型以数据集的独特子集为培训。根据评估所有基本模型特征重量分布的三项标准选择特征。与常规方法相比,RENT同时进行高质量的特征选择,同时收集用于模型解释的有用信息。此外,拟议的基于元素选择标准通过选择高度稳定的特征来保证模型的稳健性。在试验性评估中,我们比较八个多变式数据集的特征选择质量:6个用于二进制分类,2个用于回归。我们用RENT与六个既定的特征选择器对照了所有基本模型的重量分布情况。与常规方法相比,RENT同时进行高质量的选择,在预测中,同时提供所选的稳定性最佳的预测性结果。通过测试提供最佳的业绩结果。