生物医学数据分类平衡数据预处理技术研究 (A Study of Data Pre-processing Techniques for Imbalanced Biomedical Data Classification)

Biomedical data are widely accepted in developing prediction models for identifying a specific tumor, drug discovery and classification of human cancers. However, previous studies usually focused on different classifiers, and overlook the class imbalance problem in real-world biomedical datasets. There are a lack of studies on evaluation of data pre-processing techniques, such as resampling and feature selection, on imbalanced biomedical data learning. The relationship between data pre-processing techniques and the data distributions has never been analysed in previous studies. This article mainly focuses on reviewing and evaluating some popular and recently developed resampling and feature selection methods for class imbalance learning. We analyse the effectiveness of each technique from data distribution perspective. Extensive experiments have been done based on five classifiers, four performance measures, eight learning techniques across twenty real-world datasets. Experimental results show that: (1) resampling and feature selection techniques exhibit better performance using support vector machine (SVM) classifier. However, resampling and Feature Selection techniques perform poorly when using C4.5 decision tree and Linear discriminant analysis classifiers; (2) for datasets with different distributions, techniques such as Random undersampling and Feature Selection perform better than other data pre-processing methods with T Location-Scale distribution when using SVM and KNN (K-nearest neighbours) classifiers. Random oversampling outperforms other methods on Negative Binomial distribution using Random Forest classifier with lower level of imbalance ratio; (3) Feature Selection outperforms other data pre-processing methods in most cases, thus, Feature Selection with SVM classifier is the best choice for imbalanced biomedical data learning.

翻译：在制定用于确定特定肿瘤、药物发现和人类癌症分类的预测模型时,广泛接受生物医学数据;然而,以往的研究通常侧重于不同分类者,忽视了实际世界生物医学数据集中的阶级不平衡问题;缺乏关于数据预处理技术评价的研究,例如重新抽样和特征选择,缺乏关于生物医学数据学习不平衡的研究;在以往的研究中从未分析过数据处理前技术与数据分发之间的关系;本篇文章主要侧重于审查和评估一些流行的和最近开发的用于课堂不平衡学习的重新抽样和特征选择方法;我们从数据分配角度分析每一种技术的有效性;根据五个分类者、四个业绩计量、20个实际世界数据集中的八种学习技术进行了广泛的实验;实验结果表明:(1) 使用支持矢量机(SVM)分类法的重新抽样和特征选择技术表现得更好;但是,在使用C4.5决定性选择树和直线性精选方法和特征选择方法进行课堂不平衡学习;(2) 使用不同分布的数据集,例如随机性下定级技术;在使用精选的排序方法中,在使用精选的排序前分析中,选择其他方法进行更精确性分析;选择;在使用精选的排序中,在使用精选前分析中进行其他方法时,选择;在采用精选前的精选方法进行中进行其他的精选前的精选方法进行其他的精选。

相关内容

特征选择

关注 5910

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

113+阅读 · 2020年4月5日