增强编码: 一种通过编码训练标签的新型不平衡分类方法 (Enhancement Encoding: A Novel Imbalanced Classification Approach via Encoding the Training Labels)

Class imbalance, which is also called long-tailed distribution, is a common problem in classification tasks based on machine learning. If it happens, the minority data will be overwhelmed by the majority, which presents quite a challenge for data science. To address the class imbalance problem, researchers have proposed lots of methods: some people make the data set balanced (SMOTE), some others refine the loss function (Focal Loss), and even someone has noticed the value of labels influences class-imbalanced learning (Yang and Xu. Rethinking the value of labels for improving class-imbalanced learning. In NeurIPS 2020), but no one changes the way to encode the labels of data yet. Nowadays, the most prevailing technique to encode labels is the one-hot encoding due to its nice performance in the general situation. However, it is not a good choice for imbalanced data, because the classifier will treat majority and minority samples equally. In this paper, we innovatively propose the enhancement encoding technique, which is specially designed for the imbalanced classification. The enhancement encoding combines re-weighting and cost-sensitiveness, which can reflect the difference between hard and easy (or minority and majority) classes. To reduce the number of validation samples and the computation cost, we also replace the confusion matrix with a novel soft-confusion matrix which works better with a small validation set. In the experiments, we evaluate the enhancement encoding with three different types of loss. And the results show that enhancement encoding is very effective to improve the performance of the network trained with imbalanced data. Particularly, the performance on minority classes is much better.

翻译：在机器学习中，类别不平衡问题（也称为长尾分布）是一种常见的分类问题。如果发生不平衡，少数类数据将受到多数类数据的压倒，这给数据科学带来了很大的挑战。为了解决类别不平衡问题，研究人员提出了许多方法：一些人使数据集平衡（SMOTE），一些人改进损失函数（Focal Loss），甚至有人注意到标签的价值影响类别不平衡学习（Yang and Xu. Rethinking the value of labels for improving class-imbalanced learning. In NeurIPS 2020）, 但是还没有人改变数据标签的编码方式。当前最普遍的标签编码技术是独热编码，因为其在一般情况下的性能很好。然而，对于不平衡数据，它不是一个好的选择，因为分类器将等同地处理大多数和少数样本。在本文中，我们创新性地提出了增强编码技术，该技术专门为不平衡分类而设计。增强编码结合了重新加权和成本敏感性，可以反映困难和易于（或少数和多数）类之间的差异。为了减少验证样本的数量和计算成本，我们还用一种新颖的软混淆矩阵替换了混淆矩阵，该矩阵与小型验证集更好地配合使用。在实验中，我们评估了增强编码和三种不同类型的损失函数。结果表明，增强编码非常有效地提高了使用不平衡数据训练的网络的性能。特别是在少数类上表现得更好。