Given $M \geq 2$ distributions defined on a general measurable space, we introduce a nonparametric (kernel) measure of multi-sample dissimilarity (KMD) -- a parameter that quantifies the difference between the $M$ distributions. The population KMD, which takes values between 0 and 1, is 0 if and only if all the $M$ distributions are the same, and 1 if and only if all the distributions are mutually singular. Moreover, KMD possesses many properties commonly associated with $f$-divergences such as the data processing inequality and invariance under bijective transformations. The sample estimate of KMD, based on independent observations from the $M$ distributions, can be computed in near linear time (up to logarithmic factors) using $k$-nearest neighbor graphs (for $k \ge 1$ fixed). We develop an easily implementable test for the equality of $M$ distributions based on the sample KMD that is consistent against all alternatives where at least two distributions are not equal. We prove central limit theorems for the sample KMD, and provide a complete characterization of the asymptotic power of the test, as well as its detection threshold. The usefulness of our measure is demonstrated via real and synthetic data examples; our method is also implemented in an R package.
翻译:鉴于在一般可测量空间上定义的$M \ geq 2美元分布值,我们引入了多种抽样差异的非参数(内核)度量(KMD) -- -- 这个参数可以量化美元分布之间的差数。人口KMD的数值在0到1之间,只有在所有美元分布均值相同的情况下才为0,只有所有美元分布均值相同时才为0美元,只有所有分布均值均值时才为1美元。此外,KMD拥有许多通常与美元差异值相关的属性(内核),例如数据处理不平等和双向转换中的变异。根据对美元分布的独立观察得出的KMD的抽样估计,可以在近线性时间(最高为对数值)计算。KMD的抽样估计可以使用美元最接近的相邻图表($k \ ge 1 固定 ) 。我们根据样本KMD 开发了一个易于执行的美元分布均值的测试标准,该测试与至少两种分布不均值的所有替代方法一致。我们证明,通过样本检测的中央限值和合成数据作为测试模型测试标准,作为我们测算的模型的测试标准。