集群分析统计能力 (Statistical power for cluster analysis)

Cluster algorithms are gaining in popularity due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream programming languages and statistical software. While researchers can follow guidelines to choose the right algorithms, and to determine what constitutes convincing clustering, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we take a simulation approach to estimate power and classification accuracy for popular analysis pipelines. We systematically varied cluster size, number of clusters, number of different features between clusters, effect size within each different feature, and cluster covariance structure in generated datasets. We then subjected these datasets to common dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, hierarchical agglomerative clustering with Ward linkage and Euclidean distance, or average linkage and cosine distance, HDBSCAN). Furthermore, we simulated additional datasets to explore the effect of sample size and cluster separation on statistical power and classification accuracy. We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power can be achieved with relatively small samples (N=20 per subgroup), provided cluster separation is large ({\Delta}=4). Finally, we discuss whether fuzzy clustering (c-means) could provide a more parsimonious alternative for identifying separable multivariate normal distributions, particularly those with lower centroid separation.

翻译：集群算法越来越受欢迎,原因是它们具有识别数据中离散分组的强大能力,而且主流编程语言和统计软件的可访问性日益增强。研究人员可以遵循准则选择正确的算法,并确定何为令人信服的组群,但没有固定的方法来计算集群分析的先验统计力量。在这里,我们采用模拟方法来估计大众分析管道的能量和分类准确性。我们系统化地各不相同的集群规模、组群数量、各组群之间不同特征的数量、每个不同特征中的影响大小以及生成数据集中的群集易变结构。我们发现,这些数据集是按共同的维度减少方法(一个不单、多维度缩放、或统一的多重近似和投影)和群集算法(k-平均值、与沃德和爱立度相连接的等级聚合群集群集、或平均连接和连接的距离。此外,我们模拟了额外的数据集,以探讨抽样规模和群集分解对统计力量和分类准确性的影响。我们发现,组群集结果的驱动力是由大型影响大小或不同程度的分化方法驱动,这些群集群分解,最终由不同的统计结构提供。