In many application settings, the data are plagued with missing features. These hinder data analysis. An abundant literature addresses missing values in an inferential framework, where the aim is to estimate parameters and their variance from incomplete tables. Here, we consider supervised-learning settings where the objective is to best predict a target when missing values appear in both training and test sets. We analyze which missing-values strategies lead to good prediction. We show the consistency of two approaches to estimating the prediction function. The most striking one shows that the widely-used mean imputation prior to learning method is consistent when missing values are not informative. This is in contrast with inferential settings as mean imputation is known to have serious drawbacks in terms of deformation of the joint and marginal distribution of the data. That such a simple approach can be consistent has important consequences in practice. This result holds asymptotically when the learning algorithm is consistent in itself. We contribute additional analysis on decision trees as they can naturally tackle empirical risk minimization with missing values. This is due to their ability to handle the half-discrete nature of variables with missing values. After comparing theoretically and empirically different missing-values strategies in trees, we recommend using the missing incorporated in attributes method as it can handle both non-informative and informative missing values.
翻译:在许多应用程序设置中, 数据都充满了缺失的特性。 这些特性阻碍了数据分析。 大量文献在一个推断框架中处理缺失值, 其目的在于估算参数及其与不完整表格的差异。 这里, 我们考虑的是监督学习设置, 目标是在培训组和测试组出现缺失值时, 最佳预测目标; 我们分析哪些缺失值战略导致良好的预测。 我们显示两种方法在估算预测函数时的一致性。 最引人注目的一个显示, 学习前广泛使用的平均值估算方法在缺失值不提供信息时是一致的。 这与平均估算值的假设环境不同, 因为已知平均估算值在数据联合和边际分布的变形方面有严重的缺陷。 这种简单方法在实践上可以持续产生重要的后果。 当学习算法本身的一致性时, 我们为决策树提供更多的分析, 因为它们可以自然地处理与缺失值最小化的经验风险。 这是因为它们有能力处理与缺失值的半偏差变量。 在对缺少值进行比较后, 将缺少的数值纳入理论性和实验性价值中, 能够将缺少的数值纳入我们所缺少的数值中, 比较为缺乏的数值中我们所缺少的特性, 。