UMAP has supplanted t-SNE as state-of-the-art for visualizing high-dimensional datasets in many disciplines, but the reason for its success is not well understood. In this work, we investigate UMAP's sampling based optimization scheme in detail. We derive UMAP's effective loss function in closed form and find that it differs from the published one. As a consequence, we show that UMAP does not aim to reproduce its theoretically motivated high-dimensional UMAP similarities. Instead, it tries to reproduce similarities that only encode the shared $k$ nearest neighbor graph, thereby challenging the previous understanding of UMAP's effectiveness. Instead, we claim that the key to UMAP's success is its implicit balancing of attraction and repulsion resulting from negative sampling. This balancing in turn facilitates optimization via gradient descent. We corroborate our theoretical findings on toy and single cell RNA sequencing data.
翻译:UMAP已经取代了t-SNE, 将其作为在许多学科中直观高维数据集的最新技术, 但成功的原因却不十分清楚。 在这项工作中,我们详细调查UMAP基于抽样的优化计划。 我们以封闭的形式得出UMAP的有效损失功能,发现其与所公布的功能不同。 因此,我们表明 UMAP无意复制其具有理论动机的高维UMAP相似之处。 相反,它试图复制相似之处,这些相似之处只是编码了共享的近邻美元图,从而挑战了UMAP以前对UMAP有效性的理解。 相反,我们声称, UMAP成功的关键在于它从负抽样中暗含的吸引力和反作用的平衡。这反过来又有助于通过梯度下降实现优化。 我们证实了我们关于玩具和单细胞RNA测序数据的理论结论。