In this paper, we propose a new wrapper approach for semi-supervised feature selection. A common strategy in semi-supervised learning is to augment the training set by pseudo-labeled unlabeled examples. However, the pseudo-labeling procedure is prone to error and has a high risk of disrupting the learning algorithm with additional noisy labeled training data. To overcome this, we propose to model explicitly the mislabeling error during the learning phase with the overall aim of selecting the most relevant feature characteristics. We derive a $\mathcal{C}$-bound for Bayes classifiers trained over partially labeled training sets by taking into account the mislabeling errors. The risk bound is then considered as an objective function that is minimized over the space of possible feature subsets using a genetic algorithm. In order to produce both sparse and accurate solution, we propose a modification of a genetic algorithm with the crossover based on feature weights and recursive elimination of irrelevant features. Empirical results on different data sets show the effectiveness of our framework compared to several state-of-the-art semi-supervised feature selection approaches.
翻译:在本文中,我们提议对半监督性特征选择采用新的包装方法。半监督性学习的共同战略是增加假标签未贴标签的例子所设定的培训。然而,伪标签程序容易出错,而且极有可能以额外噪音标签培训数据干扰学习算法。为了克服这一点,我们提议在学习阶段明确模拟错误标签错误,总体目标是选择最相关的特征。我们通过考虑错误标签错误,为在部分标签培训组合中接受培训的贝斯族分类员推出一个$mathcal{C}美元约束值。然后,将风险约束视为一个客观功能,在使用基因算法的可能的特性分类空间上最小化。为了产生稀少和准确的解决方案,我们提议修改基因算法,根据特征权重和反复消除无关特征进行交叉。关于不同数据集的预测结果显示我们框架与若干州级半监督性特征选择方法相比的有效性。