瓦森斯坦群岛分析 (Wasserstein Archetypal Analysis)

Archetypal analysis is an unsupervised machine learning method that summarizes data using a convex polytope. In its original formulation, for fixed k, the method finds a convex polytope with k vertices, called archetype points, such that the polytope is contained in the convex hull of the data and the mean squared Euclidean distance between the data and the polytope is minimal. In the present work, we consider an alternative formulation of archetypal analysis based on the Wasserstein metric, which we call Wasserstein archetypal analysis (WAA). In one dimension, there exists a unique solution of WAA and, in two dimensions, we prove existence of a solution, as long as the data distribution is absolutely continuous with respect to Lebesgue measure. We discuss obstacles to extending our result to higher dimensions and general data distributions. We then introduce an appropriate regularization of the problem, via a Renyi entropy, which allows us to obtain existence of solutions of the regularized problem for general data distributions, in arbitrary dimensions. We prove a consistency result for the regularized problem, ensuring that if the data are iid samples from a probability measure, then as the number of samples is increased, a subsequence of the archetype points converges to the archetype points for the limiting data distribution, almost surely. Finally, we develop and implement a gradient-based computational approach for the two-dimensional problem, based on the semi-discrete formulation of the Wasserstein metric. Our analysis is supported by detailed computational experiments.

翻译：拱形分析是一种不受监督的机器学习方法,它利用一个 convex 多元体来对数据进行汇总。在最初的配方中,对于固定的 k,该方法找到一种与 k 脊椎的共解多元体,称为弧型点,这样,多元体就包含在数据的锥体内,而数据与多元体之间的平均平方欧立度距离是微乎其微的。在目前的工作中,我们考虑一种基于瓦瑟斯坦指标(我们称之为Wasserstein 拱形体分析(WAAA))的古板分析的替代配方。在一个方面,WAAA有一个独特的解决方案,在两个方面,我们证明存在一种解决办法,只要数据分布绝对持续在数据架状体内,我们讨论将结果扩展到更高尺寸和一般数据分布之间的平均平方格距离。我们然后通过一个基于 Renyi 的导体导法,通过这个方法,我们得以找到一般数据分布的常规化问题的解决方案,在任意的维度上。在两个方面,WAA有独特的解决办法,在两个维值分析中,我们证明一种解决办法的正态的正态的正态的正态的正态,我们发现一个正态的正态的正态的正态的正态值是最终的正态的精确度,如果我们测值的基的基体型号的精确度的基点的精确度,则会测量度,我们测值的精确度,则会测量度的精确度,我们测算的精确度的精确度的精确度,我们测算的精确度是测量度的精确度的测测测测测测测测度,我们测度的精确度的测度,我们测量度是测量度的精确度的测测测测测测测测测测测测测测测测测测测测测测测测的次的次的次的次的次的次的测测测测测测测测测测测测测测测算结果的次的测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测测算的