Clustering approaches that utilize convex loss functions have recently attracted growing interest in the formation of compact data clusters. Although classical methods like k-means and its wide family of variants are still widely used, all of them require the number of clusters k to be supplied as input, and many are notably sensitive to initialization. Convex clustering provides a more stable alternative by formulating the clustering task as a convex optimization problem, ensuring a unique global solution. However, it faces challenges in handling high-dimensional data, especially in the presence of noise and outliers. Additionally, strong fusion regularization, controlled by the tuning parameter, can hinder effective cluster formation within a convex clustering framework. To overcome these challenges, we introduce a robust approach that integrates convex clustering with the Median of Means (MoM) estimator, thus developing an outlier-resistant and efficient clustering framework that does not necessitate prior knowledge of the number of clusters. By leveraging the robustness of MoM alongside the stability of convex clustering, our method enhances both performance and efficiency, especially on large-scale datasets. Theoretical analysis demonstrates weak consistency under specific conditions, while experiments on synthetic and real-world datasets validate the method's superior performance compared to existing approaches.
翻译:利用凸损失函数的聚类方法近年来在形成紧凑数据簇方面引起了日益增长的关注。尽管经典方法如k-means及其广泛的变体仍被广泛使用,但它们均需预先提供聚类数量k作为输入,且许多方法对初始化条件极为敏感。凸聚类通过将聚类任务表述为凸优化问题,确保了唯一的全局解,从而提供了更稳定的替代方案。然而,该方法在处理高维数据时面临挑战,尤其是在存在噪声和异常值的情况下。此外,由调谐参数控制的强融合正则化可能阻碍凸聚类框架内有效簇的形成。为克服这些挑战,我们提出了一种鲁棒方法,将凸聚类与均值中位数(MoM)估计器相结合,从而构建了一个无需预先知晓聚类数量、且对异常值具有抵抗力的高效聚类框架。通过结合MoM的鲁棒性与凸聚类的稳定性,我们的方法在性能和效率上均得到提升,尤其适用于大规模数据集。理论分析表明,在特定条件下该方法具有弱一致性,而在合成与真实数据集上的实验验证了其相较于现有方法的优越性能。