Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$\ discards structural information and performs poorly in practice, while Top-$K$\ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$\ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$\ matches the accuracy of Top-$K$\ while reducing wall-clock training time by up to 60.7\%, offering an efficient and scalable solution that combines the robustness of Rand-$K$\ with the strong performance of Top-$K$.
翻译:通信仍然是大规模分布式机器学习中的核心瓶颈,而梯度稀疏化已成为缓解这一挑战的有效策略。然而,现有梯度压缩器存在显著局限:Rand-K丢弃了结构信息且在实际中表现不佳,而Top-K虽保留了信息丰富的梯度条目,但失去了压缩收缩性质,并需要昂贵的All-Gather操作。本文提出ARC-Top-K,一种兼容All-Reduce的Top-K压缩器,它通过轻量级梯度草图实现节点间稀疏模式对齐,从而在保留全局重要信息的同时实现无需索引的All-Reduce操作。ARC-Top-K具有可证明的收缩性,当与动量误差反馈(EF21M)结合时,在标准假设下比原始EF21M实现了线性加速和更优的收敛速率。实验表明,ARC-Top-K在保持Top-K精度的同时,将训练时间缩短达60.7%,提供了兼具Rand-K鲁棒性与Top-K高性能的高效可扩展解决方案。