Text clustering is arguably one of the most important topics in modern data mining. Nevertheless, text data require tokenization which usually yields a very large and highly sparse term-document matrix, which is usually difficult to process using conventional machine learning algorithms. Methods such as Latent Semantic Analysis have helped mitigate this issue, but are nevertheless not completely stable in practice. As a result, we propose a new feature agglomeration method based on Nonnegative Matrix Factorization. NMF is employed to separate the terms into groups, and then each group`s term vectors are agglomerated into a new feature vector. Together, these feature vectors create a new feature space much more suitable for clustering. In addition, we propose a new deterministic initialization for spherical K-Means, which proves very useful for this specific type of data. In order to evaluate the proposed method, we compare it to some of the latest research done in this field, as well as some of the most practiced methods. In our experiments, we conclude that the proposed method either significantly improves clustering performance, or maintains the performance of other methods, while improving stability in results.
翻译:然而,文本数据需要象征性化,通常产生一个非常大和高度稀少的术语文件矩阵,通常很难使用传统的机器学习算法处理。 液态语义分析等方法帮助缓解了这一问题,但实际上并不完全稳定。结果,我们提出基于非负矩阵系数的新的特征聚集法。NMF用于将术语分成若干组,然后将每个组的术语矢量集中到一个新的特性矢量中。这些特性矢量加在一起,创造一个新的功能空间,这通常很难使用传统的机器学习算法进行处理。此外,我们建议对球形K-Means采用新的确定性初始化方法,这证明对于这种特定类型的数据非常有用。为了评估拟议的方法,我们将其与这一领域最近进行的一些研究以及一些最实用的方法进行比较。我们在实验中得出的结论是,拟议的方法要么大大改进了组合的性能,要么保持其他方法的性能,同时改进了稳定性。