平衡天平：从不平衡数据中学习的理论与算法框架 (Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data)

Class imbalance remains a major challenge in machine learning, especially in multi-class problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes-consistent. This paper introduces a novel theoretical framework for analyzing generalization in imbalanced classification. We propose a new class-imbalanced margin loss function for both binary and multi-class settings, prove its strong $H$-consistency, and derive corresponding learning guarantees based on empirical loss and a new notion of class-sensitive Rademacher complexity. Leveraging these theoretical results, we devise novel and general learning algorithms, IMMAX (Imbalanced Margin Maximization), which incorporate confidence margins and are applicable to various hypothesis sets. While our focus is theoretical, we also present extensive empirical results demonstrating the effectiveness of our algorithms compared to existing baselines.

翻译：类别不平衡仍然是机器学习中的一个主要挑战，尤其是在具有长尾分布的多类别问题中。现有方法，如数据重采样、成本敏感技术和逻辑损失修改，虽然流行且通常有效，但缺乏坚实的理论基础。例如，我们证明了成本敏感方法不是贝叶斯一致的。本文引入了一个新颖的理论框架，用于分析不平衡分类中的泛化性能。我们为二分类和多分类设置提出了一种新的类别不平衡边际损失函数，证明了其强 $H$-一致性，并基于经验损失和一种新的类别敏感Rademacher复杂度概念推导了相应的学习保证。利用这些理论结果，我们设计了新颖且通用的学习算法——IMMAX（不平衡边际最大化），该算法结合了置信边际并适用于各种假设集。虽然我们的重点是理论性的，但我们也提供了广泛的实证结果，证明了我们的算法相较于现有基线的有效性。