The scale of transformer model pre-training is constrained by the increasing computation and communication cost. Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint with minimum impact on accuracy. Despite algorithmic efficiency, bottleneck architectures scale poorly under standard tensor parallelism. Simply applying 3D parallelism designed for full-rank methods leads to excessive communication and poor GPU utilization. To address this limitation, we propose BOOST, an efficient training framework tailored for large-scale low-rank bottleneck architectures. BOOST introduces a novel Bottleneck-aware Tensor Parallelism, and combines optimizations such as online-RMSNorm, linear layer grouping, and low-rank activation checkpointing to achieve end-to-end training speedup. Evaluations on different low-rank bottleneck architectures demonstrate that BOOST achieves 1.46-1.91$\times$ speedup over full-rank model baselines and 1.87-2.27$\times$ speedup over low-rank model with naively integrated 3D parallelism, with improved GPU utilization and reduced communication overhead.
翻译:Transformer模型预训练的规模受限于日益增长的计算与通信开销。低秩瓶颈架构为显著减少训练时间和内存占用提供了一种有前景的解决方案,同时对精度影响最小。尽管算法效率高,但瓶颈架构在标准张量并行下的扩展性较差。直接应用为全秩方法设计的3D并行策略会导致通信开销过大和GPU利用率低下。为解决这一局限,我们提出了BOOST,一个专为大规模低秩瓶颈架构设计的高效训练框架。BOOST引入了一种新颖的瓶颈感知张量并行技术,并结合了在线RMSNorm、线性层分组和低秩激活检查点等优化方法,以实现端到端的训练加速。在不同低秩瓶颈架构上的评估表明,BOOST相比全秩模型基线实现了1.46-1.91倍的加速,相比简单集成3D并行的低秩模型实现了1.87-2.27倍的加速,同时提高了GPU利用率并降低了通信开销。