The clustering and visualisation of high-dimensional data is a ubiquitous task in modern data science. Popular techniques include nonlinear dimensionality reduction methods like t-SNE or UMAP. These methods face the `scale-problem' of clustering: when dealing with the MNIST dataset, do we want to distinguish different digits or do we want to distinguish different ways of writing the digits? The answer is task dependent and depends on scale. We revisit an idea of Robinson & Pierce-Hoffman that exploits an underlying scaling symmetry in t-SNE to replace 2-dimensional with (2+1)-dimensional embeddings where the additional parameter accounts for scale. This gives rise to the t-SNE tree (short: tree-SNE). We prove that the optimal embedding depends continuously on the scaling parameter for all initial conditions outside a set of measure 0: the tree-SNE tree exists. This idea conceivably extends to other attraction-repulsion methods and is illustrated on several examples.
翻译:高维数据的聚类与可视化是现代数据科学中普遍存在的任务。流行的技术包括非线性降维方法,如t-SNE或UMAP。这些方法面临聚类的“尺度问题”:在处理MNIST数据集时,我们是想区分不同的数字,还是想区分书写同一数字的不同方式?答案取决于具体任务,且与尺度相关。我们重新审视了Robinson与Pierce-Hoffman提出的一个思想,该思想利用t-SNE中潜在的尺度对称性,将二维嵌入替换为(2+1)维嵌入,其中增加的参数用于表征尺度。这便产生了t-SNE树(简称:tree-SNE)。我们证明,对于所有测度为零的初始条件集之外的初始条件,最优嵌入随尺度参数连续变化:树-SNE树是存在的。这一思想有望推广至其他吸引-排斥方法,并通过若干示例进行了说明。