Clustering short text embeddings is a foundational task in natural language processing, yet remains challenging due to the need to specify the number of clusters in advance. We introduce a scalable spectral method that estimates the number of clusters directly from the structure of the Laplacian eigenspectrum, constructed using cosine similarities and guided by an adaptive sampling strategy. This sampling approach enables our estimator to efficiently scale to large datasets without sacrificing reliability. To support intrinsic evaluation of cluster quality without ground-truth labels, we propose the Cohesion Ratio, a simple and interpretable evaluation metric that quantifies how much intra-cluster similarity exceeds the global similarity background. It has an information-theoretic motivation inspired by mutual information, and in our experiments it correlates closely with extrinsic measures such as normalized mutual information and homogeneity. Extensive experiments on six short-text datasets and four modern embedding models show that standard algorithms like K-Means and HAC, when guided by our estimator, significantly outperform popular parameter-light methods such as HDBSCAN, OPTICS, and Leiden. These results demonstrate the practical value of our spectral estimator and Cohesion Ratio for unsupervised organization and evaluation of short text data. Implementation of our estimator of k and Cohesion Ratio, along with code for reproducing the experiments, is available at https://anonymous.4open.science/r/towards_clustering-0C2E.
翻译:短文本嵌入聚类是自然语言处理中的一项基础任务,但由于需要预先指定聚类数量,该任务仍具挑战性。本文提出一种可扩展的谱方法,该方法通过余弦相似度构建拉普拉斯特征谱,并采用自适应采样策略引导,直接从特征谱结构中估计聚类数量。这种采样策略使我们的估计器能够在保持可靠性的同时,高效扩展到大规模数据集。为支持在无真实标签情况下对聚类质量进行内在评估,我们提出了凝聚度比率,这是一种简单且可解释的评估指标,用于量化类内相似度超出全局相似度背景的程度。该指标受互信息启发,具有信息论动机,并在实验中与标准化互信息、同质性等外在评估指标高度相关。在六个短文本数据集和四种现代嵌入模型上的大量实验表明,标准算法如K-Means和层次凝聚聚类,在采用我们的估计器引导后,其性能显著优于HDBSCAN、OPTICS和Leiden等流行的参数轻量方法。这些结果证明了我们的谱估计器和凝聚度比率在短文本数据无监督组织与评估中的实用价值。我们的聚类数量估计器及凝聚度比率的实现代码,以及实验复现代码,可在https://anonymous.4open.science/r/towards_clustering-0C2E获取。