Clustering is a widely used technique with a long and rich history in a variety of areas. However, most existing algorithms do not scale well to large datasets, or are missing theoretical guarantees of convergence. This paper introduces a provably robust clustering algorithm based on loss minimization that performs well on Gaussian mixture models with outliers. It provides theoretical guarantees that the algorithm obtains high accuracy with high probability under certain assumptions. Moreover, it can also be used as an initialization strategy for $k$-means clustering. Experiments on real-world large-scale datasets demonstrate the effectiveness of the algorithm when clustering a large number of clusters, and a $k$-means algorithm initialized by the algorithm outperforms many of the classic clustering methods in both speed and accuracy, while scaling well to large datasets such as ImageNet.
翻译:集束是一种广泛使用的技术,在多个领域都有悠久而丰富的历史。然而,大多数现有的算法并不适合大型数据集,或者缺少理论的汇合保证。本文介绍了一种基于损失最小化的、在高斯混合模型和外部线上运行良好的、可被证实可靠的集束算法。它提供了理论保证,在某种假设下,算法获得的高度精度和概率很高。此外,它也可以用作美元手段集集的初始化战略。 实际世界大型数据集实验表明,当将大量组集组合在一起时,算法的有效性,而由算法开始的美元手段算法在速度和准确性上都超过了许多典型的集束方法,同时推广到象图像网这样的大型数据集。</s>