In data science, determining proximity between observations is critical to many downstream analyses such as clustering, classification and prediction. However, when the underlying probability distribution of the data is unclear, the function used to compute similarity between data points is often arbitrarily chosen. Here, we present a novel definition of proximity, Semblance, that uses the empirical distribution of a feature across all observations to inform the similarity between each pair. The advantage of Semblance lies in its distribution-free formulation and its ability to place greater emphasis on proximity between observation pairs that fall at the outskirts of the data distribution, as opposed to those that fall towards the center. We prove that Semblance is a valid Mercer kernel, thus allowing its principled use in kernel-based learning algorithms. Semblance can be applied to any data modality, and we demonstrate its consistently improved performance against conventional methods through simulations and three real case studies from diverse applications - cell-type classification in single-cell transcriptomics, image reconstruction, and financial forecasting.
翻译:在数据科学方面,确定观测之间的接近性对于诸如集群、分类和预测等许多下游分析至关重要。然而,当数据的基本概率分布不明确时,计算数据点之间相似性的功能往往被任意选择。在这里,我们提出了一个关于接近性的新定义,即Semblance,它使用所有观测的实验性分布特征来说明每一对观测的相似性。 其优点在于无分布式配方,以及它能够更加强调处于数据分布边缘的对观测对的接近性,而不是那些落在中心上的对观测对的近性。我们证明,Semblance是一种有效的Mercer内核,因此允许在以内核为基础的学习算法中有原则地使用它。 Semblance可以适用于任何数据模式,我们通过模拟和三个来自不同应用的实际案例研究—— 单细胞记录组的细胞型分类、图像重建以及财务预测,来表明它相对于常规方法的绩效不断提高。