K-中程变异的随机和确定型中小行星初始化经验比较 (An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations)

K-Means is one of the most used algorithms for data clustering and the usual clustering method for benchmarking. Despite its wide application it is well-known that it suffers from a series of disadvantages; it is only able to find local minima and the positions of the initial clustering centres (centroids) can greatly affect the clustering solution. Over the years many K-Means variations and initialisation techniques have been proposed with different degrees of complexity. In this study we focus on common K-Means variations along with a range of deterministic and stochastic initialisation techniques. We show that, on average, more sophisticated initialisation techniques alleviate the need for complex clustering methods. Furthermore, deterministic methods perform better than stochastic methods. However, there is a trade-off: less sophisticated stochastic methods, executed multiple times, can result in better clustering. Factoring in execution time, deterministic methods can be competitive and result in a good clustering solution. These conclusions are obtained through extensive benchmarking using a range of synthetic model generators and real-world data sets.

翻译：K-Means是数据集群最常用的算法之一,也是通常采用的基准集法之一。尽管应用范围很广,但众所周知,它有一系列不利之处;它只能找到本地微型,最初的集束中心(中心机器人)的位置可以极大地影响集束解决办法。多年来,提出了许多K-Means变异和初始化技术,其复杂程度不同。在这项研究中,我们侧重于通用的K-Means变异以及一系列确定和随机初始化技术。我们显示,平均而言,较先进的初始化技术缓解了对复杂集束方法的需要。此外,确定性方法比随机化方法效果更好。然而,存在着一种权衡:较不那么复杂的集束方法,执行多次,可以导致更好的集束。计算执行时间,确定性方法可以具有竞争力,并产生良好的集束解决办法。这些结论是通过一系列合成模型生成器和现实世界数据集的广泛基准得出的。