调整幅度和其他形式的未经监督的预处理前的处理方式将偏见引入交叉验证 (Rescaling and other forms of unsupervised preprocessing introduce bias into cross-validation) - 专知论文

会员服务 ·

0

再缩放 · 无监督 · 估计/估计量 · 交叉验证 · 有偏 ·

2019 年 7 月 12 日

Rescaling and other forms of unsupervised preprocessing introduce bias into cross-validation

翻译：调整幅度和其他形式的未经监督的预处理前的处理方式将偏见引入交叉验证

Amit Moscovich,Saharon Rosset

Cross-validation is the de-facto standard for model evaluation and selection. In proper use, it provides an unbiased estimate of a model's predictive performance. However, data sets often undergo various forms of preprocessing, such as mean-centering, rescaling, dimensionality reduction and outlier removal, prior to cross-validation. It is widely believed that such preprocessing stages, if done in an unsupervised manner that does not involve the class labels or response values, has no effect on the validity of cross-validation. In this paper, we show that this belief is not true. Preliminary unsupervised preprocessing can introduce either a positive or negative bias into the estimates of model performance. Thus, it may lead to invalid inference and sub-optimal choices of model parameters. In light of this, the scientific community should re-examine the use of preprocessing prior to cross-validation across the various application domains. By default, the parameters of all data-dependent transformations should be learned only from the training samples.

翻译：交叉校验是模型评估和选择的分层标准。在正确使用时,它提供了模型预测性能的公正估计。但是,数据集通常在交叉校验之前会经历各种形式的预处理,如平均中枢、调整、维度降低和超值清除。人们普遍认为,这些预处理阶段,如果在没有监督的情况下进行,不涉及等级标签或响应值,对交叉校验的有效性没有影响。在本文中,我们表明这一信念是不真实的。初步未经监督的预处理可以在模型性能估计中引入正或负偏差。因此,它可能导致无效的推论和模型参数的次优选择。鉴于这一点,科学界应当重新审查预处理的使用情况,然后对各个应用领域进行交叉校验。默认情况下,所有数据依赖的变异参数只能从培训样本中学习。

0

相关内容

再缩放

再缩放是一个类别不平衡学习的一个基本策略。当训练集中正、反例数据不均等时，令m+表示正例数，m-表示反例数，并且需对预测值进行缩放调整。

因果图，Causal Graphs，52页ppt

因果图，Causal Graphs，52页ppt

专知会员服务

238+阅读 · 2020年4月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

161+阅读 · 2020年3月18日

【华南理工大学】无监督多类域自适应:理论、算法和实践，Unsupervised Multi-Class Domain Adaptation: Theory, Algorithms, and Practice

【华南理工大学】无监督多类域自适应:理论、算法和实践，Unsupervised Multi-Class Domain Adaptation: Theory, Algorithms, and Practice

专知会员服务

31+阅读 · 2020年2月26日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

176+阅读 · 2020年2月1日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

52+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

45+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

53+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

144+阅读 · 2019年10月12日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

90+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

25+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

17+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

26+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

16+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【推荐】用Python/OpenCV实现增强现实

【推荐】用Python/OpenCV实现增强现实

机器学习研究会

14+阅读 · 2017年11月16日

【学习】(Python)SVM数据分类

【学习】(Python)SVM数据分类

机器学习研究会

6+阅读 · 2017年10月15日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

Continual Unsupervised Representation Learning

Continual Unsupervised Representation Learning

Arxiv

7+阅读 · 2019年10月31日

Language Models as Knowledge Bases?

Arxiv

6+阅读 · 2019年9月4日

Small Data Challenges in Big Data Era: A Survey of Recent Progress on Unsupervised and Semi-Supervised Methods

Small Data Challenges in Big Data Era: A Survey of Recent Progress on Unsupervised and Semi-Supervised Methods

Arxiv

88+阅读 · 2019年3月27日

A Comprehensive Comparison of Unsupervised Network Representation Learning Methods

Arxiv

5+阅读 · 2019年3月19日

Multi-class Classification without Multi-class Labels

Multi-class Classification without Multi-class Labels

Arxiv

4+阅读 · 2019年1月2日

A Probe into Understanding GAN and VAE models

A Probe into Understanding GAN and VAE models

Arxiv

9+阅读 · 2018年12月13日

The Bottleneck Simulator: A Model-based Deep Reinforcement Learning Approach

The Bottleneck Simulator: A Model-based Deep Reinforcement Learning Approach

Arxiv

11+阅读 · 2018年7月12日

Visual Object Tracking: The Initialisation Problem

Arxiv

9+阅读 · 2018年5月22日

Towards Human-Machine Cooperation: Self-supervised Sample Mining for Object Detection

Arxiv

6+阅读 · 2018年3月27日

VIP会员

文章信息

相关主题

估计/估计量

相关VIP内容

因果图，Causal Graphs，52页ppt

因果图，Causal Graphs，52页ppt

专知会员服务

238+阅读 · 2020年4月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

161+阅读 · 2020年3月18日

【华南理工大学】无监督多类域自适应:理论、算法和实践，Unsupervised Multi-Class Domain Adaptation: Theory, Algorithms, and Practice

【华南理工大学】无监督多类域自适应:理论、算法和实践，Unsupervised Multi-Class Domain Adaptation: Theory, Algorithms, and Practice

专知会员服务

31+阅读 · 2020年2月26日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

176+阅读 · 2020年2月1日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

52+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

45+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

53+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

144+阅读 · 2019年10月12日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

90+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

热门VIP内容

相关资讯

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

25+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

17+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

26+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

16+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【推荐】用Python/OpenCV实现增强现实

【推荐】用Python/OpenCV实现增强现实

机器学习研究会

14+阅读 · 2017年11月16日

【学习】(Python)SVM数据分类

【学习】(Python)SVM数据分类

机器学习研究会

6+阅读 · 2017年10月15日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

Continual Unsupervised Representation Learning

Continual Unsupervised Representation Learning

Arxiv

7+阅读 · 2019年10月31日

Language Models as Knowledge Bases?

Arxiv

6+阅读 · 2019年9月4日

Small Data Challenges in Big Data Era: A Survey of Recent Progress on Unsupervised and Semi-Supervised Methods

Small Data Challenges in Big Data Era: A Survey of Recent Progress on Unsupervised and Semi-Supervised Methods

Arxiv

88+阅读 · 2019年3月27日

A Comprehensive Comparison of Unsupervised Network Representation Learning Methods

Arxiv

5+阅读 · 2019年3月19日

Multi-class Classification without Multi-class Labels

Multi-class Classification without Multi-class Labels

Arxiv

4+阅读 · 2019年1月2日

A Probe into Understanding GAN and VAE models

A Probe into Understanding GAN and VAE models

Arxiv

9+阅读 · 2018年12月13日

The Bottleneck Simulator: A Model-based Deep Reinforcement Learning Approach

The Bottleneck Simulator: A Model-based Deep Reinforcement Learning Approach

Arxiv

11+阅读 · 2018年7月12日

Visual Object Tracking: The Initialisation Problem

Arxiv

9+阅读 · 2018年5月22日

Towards Human-Machine Cooperation: Self-supervised Sample Mining for Object Detection

Arxiv

6+阅读 · 2018年3月27日

微信扫码咨询专知VIP会员