Class imbalance is a substantial challenge in classifying many real-world cases. Synthetic over-sampling methods have been effective to improve the performance of classifiers for imbalance problems. However, most synthetic over-sampling methods generate non-diverse synthetic instances within the convex hull formed by the existing minority instances as they only concentrate on the minority class and ignore the vast information provided by the majority class. They also often do not perform well for extremely imbalanced data as the fewer the minority instances, the less information to generate synthetic instances. Moreover, existing methods that generate synthetic instances using the majority class distributional information cannot perform effectively when the majority class has a multi-modal distribution. We propose a new method to generate diverse and adaptable synthetic instances using Synthetic Over-sampling with Minority and Majority classes (SOMM). SOMM generates synthetic instances diversely within the minority data space. It updates the generated instances adaptively to the neighbourhood including both classes. Thus, SOMM performs well for both binary and multiclass imbalance problems. We examine the performance of SOMM for binary and multiclass problems using benchmark data sets for different imbalance levels. The empirical results show the superiority of SOMM compared to other existing methods.
翻译:在对许多真实世界案例进行分类方面,分类不平衡是一个巨大的挑战。合成过度抽样方法对于提高分类者在不平衡问题方面的表现是有效的。然而,大多数合成过度抽样方法在由现有少数群体案例形成的锥体内产生非多元合成案例,因为它们只集中在少数群体,忽视多数群体提供的大量信息。它们也往往不能很好地使用极不平衡的数据,因为少数群体案例较少,产生合成案例的信息较少。此外,在多数群体有多种模式分布的情况下,利用现有方法产生合成案例无法有效发挥作用。我们提出了一个新方法,利用与少数群体和多数群体类的合成过度抽样(SOMM)生成多样化和可适应的合成案例。SOMM在少数群体数据空间内生成的合成案例多种多样,因此,SOMM在使用不同不平衡水平的基准数据集处理二进制和多级问题方面表现良好。实证结果显示SOMM与其他现有方法相比,SOMM具有优越性。