可扩展的轻参数谱方法用于短文本嵌入聚类及基于内聚度的评估指标 (Scalable Parameter-Light Spectral Method for Clustering Short Text Embeddings with a Cohesion-Based Evaluation Metric)

Clustering short text embeddings is a foundational task in natural language processing, yet remains challenging due to the need to specify the number of clusters in advance. We introduce a scalable spectral method that estimates the number of clusters directly from the structure of the Laplacian eigenspectrum, constructed using cosine similarities and guided by an adaptive sampling strategy. This sampling approach enables our estimator to efficiently scale to large datasets without sacrificing reliability. To support intrinsic evaluation of cluster quality without ground-truth labels, we propose the Cohesion Ratio, a simple and interpretable evaluation metric that quantifies how much intra-cluster similarity exceeds the global similarity background. It has an information-theoretic motivation inspired by mutual information, and in our experiments it correlates closely with extrinsic measures such as normalized mutual information and homogeneity. Extensive experiments on six short-text datasets and four modern embedding models show that standard algorithms like K-Means and HAC, when guided by our estimator, significantly outperform popular parameter-light methods such as HDBSCAN, OPTICS, and Leiden. These results demonstrate the practical value of our spectral estimator and Cohesion Ratio for unsupervised organization and evaluation of short text data. Implementation of our estimator of k and Cohesion Ratio, along with code for reproducing the experiments, is available at https://anonymous.4open.science/r/towards_clustering-0C2E.

翻译：短文本嵌入聚类是自然语言处理中的基础任务，但由于需要预先指定聚类数量，该任务仍具挑战性。我们提出了一种可扩展的谱方法，该方法通过构建基于余弦相似度的拉普拉斯特征谱，并结合自适应采样策略，直接从谱结构估计聚类数量。这种采样策略使我们的估计器能够高效扩展到大规模数据集，同时不牺牲可靠性。为支持在无真实标签情况下对聚类质量进行内在评估，我们提出了内聚比——一种简单且可解释的评估指标，用于量化类内相似度超过全局背景相似度的程度。该指标受互信息启发，具有信息论动机，在实验中与归一化互信息、同质性等外部评估指标高度相关。在六个短文本数据集和四种现代嵌入模型上的大量实验表明，标准算法（如K-Means和层次凝聚聚类）在采用我们的估计器指导时，其性能显著优于HDBSCAN、OPTICS和Leiden等主流轻参数方法。这些结果证明了我们的谱估计器和内聚比在短文本数据无监督组织与评估中的实用价值。我们的聚类数量估计器与内聚比的实现代码及实验复现代码已公开于：https://anonymous.4open.science/r/towards_clustering-0C2E。