From neuroscience and genomics to systems biology and ecology, researchers rely on clustering similarity data to uncover modular structure. Yet widely used clustering methods, such as hierarchical clustering, k-means, and WGCNA, lack principled model selection, leaving them susceptible to noise. A common workaround sparsifies a correlation matrix representation to remove noise before clustering, but this extra step introduces arbitrary thresholds that can distort the structure and lead to unreliable results. To detect reliable clusters, we capitalize on recent advances in network science to unite sparsification and clustering with principled model selection. We test two Bayesian community detection methods, the Degree-Corrected Stochastic Block Model and the Regularized Map Equation, both grounded in the Minimum Description Length principle for model selection. In synthetic data, they outperform traditional approaches, detecting planted clusters under high-noise conditions and with fewer samples. Compared to WGCNA on gene co-expression data, the Regularized Map Equation identifies more robust and functionally coherent gene modules. Our results establish Bayesian community detection as a principled and noise-resistant framework for uncovering modular structure in high-dimensional data across fields.
翻译:从神经科学与基因组学到系统生物学与生态学,研究者依赖相似性数据聚类来揭示模块化结构。然而,广泛使用的聚类方法(如层次聚类、k均值与WGCNA)缺乏原则性的模型选择机制,易受噪声干扰。常见的解决方案是在聚类前对相关性矩阵表示进行稀疏化以去除噪声,但这一额外步骤会引入可能扭曲结构并导致不可靠结果的任意阈值。为实现可靠聚类,我们借助网络科学的最新进展,将稀疏化与聚类过程通过原则性模型选择相统一。我们测试了两种基于最小描述长度原则进行模型选择的贝叶斯社区发现方法:度校正随机块模型与正则化图谱方程。在合成数据中,这些方法优于传统方案,能在高噪声条件下以更少样本检测预设聚类。在基因共表达数据上与WGCNA相比,正则化图谱方程能识别出更稳健且功能一致的基因模块。我们的研究确立了贝叶斯社区发现作为一种原则性强、抗噪声的框架,适用于跨领域高维数据的模块化结构挖掘。