Text clustering is arguably one of the most important topics in modern data mining. Nevertheless, text data require tokenization which usually yields a very large and highly sparse term-document matrix, which is usually difficult to process using conventional machine learning algorithms. Methods such as Latent Semantic Analysis have helped mitigate this issue, but are nevertheless not completely stable in practice. As a result, we propose a new feature agglomeration method based on Nonnegative Matrix Factorization. NMF is employed to separate the terms into groups, and then each group`s term vectors are agglomerated into a new feature vector. Together, these feature vectors create a new feature space much more suitable for clustering. In addition, we propose a new deterministic initialization for spherical K-Means, which proves very useful for this specific type of data. In order to evaluate the proposed method, we compare it to some of the latest research done in this field, as well as some of the most practiced methods. In our experiments, we conclude that the proposed method either significantly improves clustering performance, or maintains the performance of other methods, while improving stability in results.
A matrix algorithm runs at sublinear cost if it uses much fewer arithmetic operations than the input matrix has entries. Such algorithms are indispensable for Big Data Mining and Analysis, where input matrices are so immense that one can only access a small fraction of all their entries. Typically, however, such matrices admit their Low Rank Approximation (LRA), which one can access and process at sublinear cost. Adversary argument shows that no algorithm running at sublinear cost can output accurate LRA of worst case input matrices but we prove that some sublinear cost algorithms output a reasonably close LRA of a matrix $M$ if it (i) is sufficiently close to a low rank matrix or (ii) is a Symmetric Positive Semidefinite (SPSD) matrix that admits LRA. In both cases supporting algorithms are deterministic and output LRA in its special form of CUR LRA, memory efficient and built on a submatrix of $M$, called CUR generator, having maximal volume among all submatrices of the same size. In case (i) above we apply Cross--Approximation (C-A) iterations, running at sublinear cost and computing accurate LRA worldwide for more than a decade. We provide first limited formal support for this long-known empirical efficiency where we assume non-degeneracy of the initial submatrix of at least one C-A iteration. In case (ii) we make no additional assumptions about the input class of SPSD matrices admitting LRA or about initialization of our sublinear cost algorithms for CUR LRA.