聚合时间几乎最佳分离的混合组合组合 (Clustering Mixtures with Almost Optimal Separation in Polynomial Time)

We consider the problem of clustering mixtures of mean-separated Gaussians in high dimensions. We are given samples from a mixture of $k$ identity covariance Gaussians, so that the minimum pairwise distance between any two pairs of means is at least $\Delta$, for some parameter $\Delta > 0$, and the goal is to recover the ground truth clustering of these samples. It is folklore that separation $\Delta = \Theta (\sqrt{\log k})$ is both necessary and sufficient to recover a good clustering, at least information theoretically. However, the estimators which achieve this guarantee are inefficient. We give the first algorithm which runs in polynomial time, and which almost matches this guarantee. More precisely, we give an algorithm which takes polynomially many samples and time, and which can successfully recover a good clustering, so long as the separation is $\Delta = \Omega (\log^{1/2 + c} k)$, for any $c > 0$. Previously, polynomial time algorithms were only known for this problem when the separation was polynomial in $k$, and all algorithms which could tolerate $\textsf{poly}( \log k )$ separation required quasipolynomial time. We also extend our result to mixtures of translations of a distribution which satisfies the Poincar\'{e} inequality, under additional mild assumptions. Our main technical tool, which we believe is of independent interest, is a novel way to implicitly represent and estimate high degree moments of a distribution, which allows us to extract important information about high-degree moments without ever writing down the full moment tensors explicitly.

翻译：我们考虑的是将平均分离的高斯人混合在一起的问题。我们从一个混合的美元身份共差高斯人身上得到样本, 这样两对手段之间最小的配对距离至少为$\Delta$ > 0美元, 目标是恢复这些样品的地面真相组合。将美元和德尔塔 =\ Theta (\ sqrt rlog k}) 分离是民俗的, 美元和美元对于恢复一个良好的组合是必要和足够的, 至少理论上是信息。然而, 实现这一保证的估测器效率很低。我们给任何两种手段之间最起码的配对距离至少为$\Delta > 0, 更精确地说, 我们给出的算法, 并且可以成功地恢复一个好的组合, 只要分离是$ =Celta =\ cool =\\\\\\ coxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxlalalalalal dalalal_lal_lal_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx