积极的标签清理活动:在资源制约下提高数据集的质量 (Active label cleaning: Improving dataset quality under resource constraints) - 专知论文

会员服务 ·

0

标注 · 数据集 · MoDELS · 噪声 · Better ·

2021 年 9 月 1 日

Active label cleaning: Improving dataset quality under resource constraints

翻译：积极的标签清理活动:在资源制约下提高数据集的质量

Melanie Bernhardt,Daniel C. Castro,Ryutaro Tanno,Anton Schwaighofer,Kerem C. Tezcan,Miguel Monteiro,Shruthi Bannur,Matthew Lungren,Aditya Nori,Ben Glocker,Javier Alvarez-Valle,Ozan Oktay

from arxiv, Currently under peer-review

Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have an often-overlooked confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation - which we term "active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a new medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed active label cleaning enables correcting labels up to 4 times more effectively than typical random selection in realistic conditions, making better use of experts' valuable time for improving dataset quality.

翻译：称为标签噪声的数据注释中的缺陷不利于对机器学习模型的培训,而且对模型性能的评估具有经常被人们忽视的混乱影响。然而,在资源紧张的环境中,如医疗保健等,聘请专家通过充分重新点注大型数据集来清除标签噪音是行不通的。这项工作提倡以数据驱动的方式优先点评重新点的样本,我们称之为“活跃标签清洁”。我们建议根据估计的标签正确性和标签难度来排列实例,并采用模拟框架来评估重新标注效果。我们在自然图像和新的医学成像基准方面的实验表明,清洁噪音标签会减轻其对模型培训、评估和选择的负面影响。很显然,拟议的积极标签清理使得在现实条件下纠正标签的效果比典型随机选择要高4倍,从而更好地利用专家的宝贵时间来改进数据集的质量。

0

相关内容

【KDD2021】图神经网络，NUS- Xavier Bresson教授

【KDD2021】图神经网络，NUS- Xavier Bresson教授

专知会员服务

66+阅读 · 2021年8月20日

威斯康辛大学《机器学习导论》2020秋季课程完结，课件、视频资源已开放

专知会员服务

16+阅读 · 2020年12月25日

因果图，Causal Graphs，52页ppt

因果图，Causal Graphs，52页ppt

专知会员服务

250+阅读 · 2020年4月19日

图解FixMatch的半监督学习，The Illustrated FixMatch for Semi-Supervised Learning

图解FixMatch的半监督学习，The Illustrated FixMatch for Semi-Supervised Learning

专知会员服务

26+阅读 · 2020年4月2日

【预测天气】使用深度学习改进天气预报的进展和挑战，60页ppt，Progress and challenges for the use of deep learning to improve weather forecasts，Peter Dueben

【预测天气】使用深度学习改进天气预报的进展和挑战，60页ppt，Progress and challenges for the use of deep learning to improve weather forecasts，Peter Dueben

专知会员服务

55+阅读 · 2020年3月14日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【Google AI】开源NoisyStudent：自监督图像分类

【Google AI】开源NoisyStudent：自监督图像分类

专知会员服务

55+阅读 · 2020年2月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

PyTorch Parallel Training（单机多卡并行、混合精度、同步BN训练指南文档）

PyTorch Parallel Training（单机多卡并行、混合精度、同步BN训练指南文档）

CVer

21+阅读 · 2020年6月20日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

carla 学习笔记

carla 学习笔记

CreateAMind

9+阅读 · 2018年2月7日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

Some like it tough: Improving model generalization via progressively increasing the training difficulty

Arxiv

0+阅读 · 2021年10月25日

Statistical Learning from Biased Training Samples

Arxiv

0+阅读 · 2021年10月24日

Adjusting for unobserved confounding using large-scale propensity scores

Arxiv

0+阅读 · 2021年10月23日

PropMix: Hard Sample Filtering and Proportional MixUp for Learning with Noisy Labels

PropMix: Hard Sample Filtering and Proportional MixUp for Learning with Noisy Labels

Arxiv

0+阅读 · 2021年10月22日

Generalized Results for the Existence and Consistency of the MLE in the Bradley-Terry-Luce Model

Arxiv

0+阅读 · 2021年10月21日

Early-Learning Regularization Prevents Memorization of Noisy Labels

Early-Learning Regularization Prevents Memorization of Noisy Labels

Arxiv

3+阅读 · 2020年6月30日

Unsupervised Data Augmentation for Consistency Training

Arxiv

5+阅读 · 2019年7月10日

Active Generative Adversarial Network for Image Classification

Arxiv

4+阅读 · 2019年6月17日

Fast Interactive Image Retrieval using large-scale unlabeled data

Arxiv

4+阅读 · 2018年2月12日

Active Learning from Positive and Unlabeled Data

Arxiv

3+阅读 · 2016年2月24日

VIP会员

文章信息

相关主题

相关VIP内容

【KDD2021】图神经网络，NUS- Xavier Bresson教授

【KDD2021】图神经网络，NUS- Xavier Bresson教授

专知会员服务

66+阅读 · 2021年8月20日

威斯康辛大学《机器学习导论》2020秋季课程完结，课件、视频资源已开放

专知会员服务

16+阅读 · 2020年12月25日

因果图，Causal Graphs，52页ppt

因果图，Causal Graphs，52页ppt

专知会员服务

250+阅读 · 2020年4月19日

图解FixMatch的半监督学习，The Illustrated FixMatch for Semi-Supervised Learning

图解FixMatch的半监督学习，The Illustrated FixMatch for Semi-Supervised Learning

专知会员服务

26+阅读 · 2020年4月2日

【预测天气】使用深度学习改进天气预报的进展和挑战，60页ppt，Progress and challenges for the use of deep learning to improve weather forecasts，Peter Dueben

【预测天气】使用深度学习改进天气预报的进展和挑战，60页ppt，Progress and challenges for the use of deep learning to improve weather forecasts，Peter Dueben

专知会员服务

55+阅读 · 2020年3月14日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【Google AI】开源NoisyStudent：自监督图像分类

【Google AI】开源NoisyStudent：自监督图像分类

专知会员服务

55+阅读 · 2020年2月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

操作系统智能体：基于多模态大模型（MLLM）的通用计算设备智能体综述

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

【博士论文】推进数据高效的深度学习：非参数 Transformer、主动测试与上下文学习

自主人工智能：未来战争是否将是自主化的？

相关资讯

PyTorch Parallel Training（单机多卡并行、混合精度、同步BN训练指南文档）

PyTorch Parallel Training（单机多卡并行、混合精度、同步BN训练指南文档）

CVer

21+阅读 · 2020年6月20日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

carla 学习笔记

carla 学习笔记

CreateAMind

9+阅读 · 2018年2月7日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

相关论文

Some like it tough: Improving model generalization via progressively increasing the training difficulty

Arxiv

0+阅读 · 2021年10月25日

Statistical Learning from Biased Training Samples

Arxiv

0+阅读 · 2021年10月24日

Adjusting for unobserved confounding using large-scale propensity scores

Arxiv

0+阅读 · 2021年10月23日

PropMix: Hard Sample Filtering and Proportional MixUp for Learning with Noisy Labels

PropMix: Hard Sample Filtering and Proportional MixUp for Learning with Noisy Labels

Arxiv

0+阅读 · 2021年10月22日

Generalized Results for the Existence and Consistency of the MLE in the Bradley-Terry-Luce Model

Arxiv

0+阅读 · 2021年10月21日

Early-Learning Regularization Prevents Memorization of Noisy Labels

Early-Learning Regularization Prevents Memorization of Noisy Labels

Arxiv

3+阅读 · 2020年6月30日

Unsupervised Data Augmentation for Consistency Training

Arxiv

5+阅读 · 2019年7月10日

Active Generative Adversarial Network for Image Classification

Arxiv

4+阅读 · 2019年6月17日

Fast Interactive Image Retrieval using large-scale unlabeled data

Arxiv

4+阅读 · 2018年2月12日

Active Learning from Positive and Unlabeled Data

Arxiv

3+阅读 · 2016年2月24日

微信扫码咨询专知VIP会员