The famous $k$-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] is the most popular way of solving the $k$-means problem in practice. The algorithm is very simple: it samples the first center uniformly at random and each of the following $k-1$ centers is then always sampled proportional to its squared distance to the closest center so far. Afterward, Lloyd's iterative algorithm is run. The $k$-means++ algorithm is known to return a $\Theta(\log k)$ approximate solution in expectation. In their seminal work, Arthur and Vassilvitskii [SODA 2007] asked about the guarantees for its following \emph{greedy} variant: in every step, we sample $\ell$ candidate centers instead of one and then pick the one that minimizes the new cost. This is also how $k$-means++ is implemented in e.g. the popular Scikit-learn library [Pedregosa et al.; JMLR 2011]. We present nearly matching lower and upper bounds for the greedy $k$-means++: We prove that it is an $O(\ell^3 \log^3 k)$-approximation algorithm. On the other hand, we prove a lower bound of $\Omega(\ell^3 \log^3 k / \log^2(\ell\log k))$. Previously, only an $\Omega(\ell \log k)$ lower bound was known [Bhattacharya, Eube, R\"oglin, Schmidt; ESA 2020] and there was no known upper bound.
翻译:Arthur 和 Vassilvitskii 的著名 $k美元++ 运算法 [SODA 2007] 是实际中解决 $k美元资产问题最受欢迎的方法。 该算法非常简单: 它以随机方式对第一个中心进行统一抽样, 以下每个$k-1 中心总是按其与最近中心的平方距离进行抽样。 之后, Lloyd 的迭代算法运行。 美元++ 的算法已知会返回一个$Theta( log k) 的近似解决方案。 在它们的基本工作中, Arthur 和 Vassilvitskii [SODADA 2007] 询问了对其以下的\ emph{greedy} 变量的保障: 每一步, 我们抽样美元候选人中心, 而不是一个最大成本。 这也是在例如广受欢迎的 Scikit-learn 图书馆 [Pedregosa et al.