Cross-validation is the de-facto standard for model evaluation and selection. In proper use, it provides an unbiased estimate of a model's predictive performance. However, data sets often undergo various forms of preprocessing, such as mean-centering, rescaling, dimensionality reduction and outlier removal, prior to cross-validation. It is widely believed that such preprocessing stages, if done in an unsupervised manner that does not involve the class labels or response values, has no effect on the validity of cross-validation. In this paper, we show that this belief is not true. Preliminary unsupervised preprocessing can introduce either a positive or negative bias into the estimates of model performance. Thus, it may lead to invalid inference and sub-optimal choices of model parameters. In light of this, the scientific community should re-examine the use of preprocessing prior to cross-validation across the various application domains. By default, the parameters of all data-dependent transformations should be learned only from the training samples.
翻译:交叉校验是模型评估和选择的分层标准。 在正确使用时,它提供了模型预测性能的公正估计。 但是,数据集通常在交叉校验之前会经历各种形式的预处理,如平均中枢、调整、维度降低和超值清除。人们普遍认为,这些预处理阶段,如果在没有监督的情况下进行,不涉及等级标签或响应值,对交叉校验的有效性没有影响。在本文中,我们表明这一信念是不真实的。初步未经监督的预处理可以在模型性能估计中引入正或负偏差。因此,它可能导致无效的推论和模型参数的次优选择。鉴于这一点,科学界应当重新审查预处理的使用情况,然后对各个应用领域进行交叉校验。默认情况下,所有数据依赖的变异参数只能从培训样本中学习。