一种兼容All-Reduce的Top-K压缩器，用于通信高效的分布式学习 (An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning)

Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$\ discards structural information and performs poorly in practice, while Top-$K$\ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$\ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$\ matches the accuracy of Top-$K$\ while reducing wall-clock training time by up to 60.7\%, offering an efficient and scalable solution that combines the robustness of Rand-$K$\ with the strong performance of Top-$K$.

翻译：通信仍然是大规模分布式机器学习中的核心瓶颈，而梯度稀疏化已成为缓解这一挑战的有效策略。然而，现有梯度压缩器存在显著局限：Rand-K丢弃了结构信息且在实际中表现不佳，而Top-K虽保留了信息丰富的梯度条目，但失去了压缩收缩性质，并需要昂贵的All-Gather操作。本文提出ARC-Top-K，一种兼容All-Reduce的Top-K压缩器，它通过轻量级梯度草图实现节点间稀疏模式对齐，从而在保留全局重要信息的同时实现无需索引的All-Reduce操作。ARC-Top-K具有可证明的收缩性，当与动量误差反馈（EF21M）结合时，在标准假设下比原始EF21M实现了线性加速和更优的收敛速率。实验表明，ARC-Top-K在保持Top-K精度的同时，将训练时间缩短达60.7%，提供了兼具Rand-K鲁棒性与Top-K高性能的高效可扩展解决方案。