Markov blanket feature selection, while theoretically optimal, generally is challenging to implement. This is due to the shortcomings of existing approaches to conditional independence (CI) testing, which tend to struggle either with the curse of dimensionality or computational complexity. We propose a novel two-step approach which facilitates Markov blanket feature selection in high dimensions. First, neural networks are used to map features to low-dimensional representations. In the second step, CI testing is performed by applying the k-NN conditional mutual information estimator to the learned feature maps. The mappings are designed to ensure that mapped samples both preserve information and share similar information about the target variable if and only if they are close in Euclidean distance. We show that these properties boost the performance of the k-NN estimator in the second step. The performance of the proposed method is evaluated on synthetic, as well as real data pertaining to datacenter hard disk drive failures.
翻译:Markov 毯子特征选择虽然在理论上是最佳的,但通常很难实施。这是因为有条件独立(CI)测试的现有方法存在缺陷,往往与维度的诅咒或计算复杂度相抗衡。我们建议采取新的两步方法,便利在高维方面选择Markov 毯子特征。首先,神经网络用来绘制低维表示面的特征。第二步是使用 k-NN 有条件的相互信息估计仪对学习的特征地图进行CI测试。绘图旨在确保所绘制的样品既保存信息,也共享目标变量的类似信息,如果而且只有在离欧克莱德距离很近时。我们表明这些特性提高了K-NN的顶点在第二步的性能。对拟议方法的性能进行了合成评估,并评估了与数据中心硬盘驱动器故障有关的真实数据。