Label noise in datasets could significantly damage the performance and robustness of deep neural networks (DNNs) trained on these datasets. As the size of modern DNNs grows, there is a growing demand for automated tools for detecting such errors. In this paper, we propose post-hoc, model-agnostic noise detection and rectification methods utilizing the penultimate feature from a DNN. Our idea is based on the observation that the similarity between the penultimate feature of a mislabeled data point and its true class data points is higher than that for data points from other classes, making the probability of label occurrence within a tight, similar cluster informative for detecting and rectifying errors. Through theoretical and empirical analyses, we demonstrate that our approach achieves high detection performance across diverse, realistic noise scenarios and can automatically rectify these errors to improve dataset quality. Our implementation is available at https://anonymous.4open.science/r/noise-detection-and-rectification-AD8E.
翻译:数据集中的标签噪声会显著损害基于这些数据训练的深度神经网络(DNN)的性能与鲁棒性。随着现代DNN规模的增大,对自动化检测此类错误的工具需求日益增长。本文提出了一种利用DNN倒数第二层特征的事后、模型无关的噪声检测与修正方法。我们的思路基于以下观察:被错误标注数据点的倒数第二层特征与其真实类别数据点之间的相似度,高于其与其他类别数据点之间的相似度,这使得紧密相似簇内的标签出现概率对检测和修正错误具有信息价值。通过理论与实证分析,我们证明该方法在多样化的现实噪声场景中均能实现高检测性能,并能自动修正这些错误以提升数据集质量。我们的实现代码可在 https://anonymous.4open.science/r/noise-detection-and-rectification-AD8E 获取。