Distributed attention is a fundamental problem for scaling context window for Large Language Models (LLMs). The state-of-the-art method, Ring-Attention, suffers from scalability limitations due to its excessive communication traffic. This paper proposes a new distributed attention algorithm, Mesh-Attention, by rethinking the design space of distributed attention with a new matrix-based model. Our method assigns a two-dimensional tile -- rather than one-dimensional row or column -- of computation blocks to each GPU to achieve higher efficiency through lower communication-computation (CommCom) ratio. The general approach covers Ring-Attention as a special case, and allows the tuning of CommCom ratio with different tile shapes. Importantly, we propose a greedy algorithm that can efficiently search the scheduling space within the tile with restrictions that ensure efficient communication among GPUs. The theoretical analysis shows that Mesh-Attention leads to a much lower communication complexity and exhibits good scalability comparing to other current algorithms. Our extensive experiment results show that Mesh-Attention can achieve up to 3.4x speedup (2.9x on average) and reduce the communication volume by up to 85.4% (79.0% on average) on 256 GPUs. Our scalability results further demonstrate that Mesh-Attention sustains superior performance as the system scales, substantially reducing overhead in large-scale deployments. The results convincingly confirm the advantage of Mesh-Attention.
翻译:分布式注意力是扩展大语言模型上下文窗口的一个基本问题。当前最先进的方法Ring-Attention因其过度的通信流量而存在可扩展性限制。本文通过一种新的基于矩阵的模型重新思考分布式注意力的设计空间,提出了一种新的分布式注意力算法Mesh-Attention。我们的方法为每个GPU分配一个二维的计算块瓦片——而非一维的行或列——以通过更低的通信计算比实现更高的效率。该通用方法将Ring-Attention涵盖为一个特例,并允许通过不同的瓦片形状调整通信计算比。重要的是,我们提出了一种贪心算法,能够在瓦片内高效搜索调度空间,并施加确保GPU间高效通信的限制条件。理论分析表明,与其他现有算法相比,Mesh-Attention具有更低的通信复杂度并展现出良好的可扩展性。我们大量的实验结果表明,在256个GPU上,Mesh-Attention最高可实现3.4倍的加速(平均2.9倍),并将通信量最高减少85.4%(平均79.0%)。我们的可扩展性结果进一步证明,Mesh-Attention在系统扩展时能保持优越性能,显著减少大规模部署中的开销。这些结果有力地证实了Mesh-Attention的优势。