Effective communication is essential in distributed training, with predictability being one of its most significant characteristics. However, existing studies primarily focus on exploiting predictability through online profiling for runtime optimization, without a systematic understanding of it. In this work, we aim to systematically formulate communication predictability in distributed training, particularly in Large Language Models (LLMs) that utilize hybrid parallelism. Our analysis focuses on both traffic patterns and communication overhead. Specifically, we investigate predictable traffic patterns in typical LLMs and evaluate how various factors influence GPU utilization and effective bandwidth (two critical variables affecting communication overhead). Furthermore, we develop an analytical formulation to estimate communication overhead in LLM training, which is validated with high accuracy against empirical data. Leveraging this formulation, we propose a configuration tuning tool, ConfigTuner, to optimize training performance. Compared to Megatron-LM, the training configurations optimized by ConfigTuner demonstrate up to a 1.36$\times$ increase in throughput. Compared to Alpa, ConfigTuner generates the same configuration suggestion while significantly reducing the search complexity.
翻译:在分布式训练中,有效通信至关重要,而可预测性是其最重要的特征之一。然而,现有研究主要侧重于通过在线分析利用可预测性进行运行时优化,缺乏对其系统性的理解。在本工作中,我们旨在系统性地阐述分布式训练中的通信可预测性,特别是在采用混合并行的大语言模型训练中。我们的分析聚焦于流量模式和通信开销两个方面。具体而言,我们研究了典型大语言模型中可预测的流量模式,并评估了各种因素如何影响GPU利用率和有效带宽(这两个影响通信开销的关键变量)。此外,我们建立了一个分析模型来估计大语言模型训练中的通信开销,该模型经实证数据验证具有很高的准确性。利用此模型,我们提出了一个配置调优工具ConfigTuner,以优化训练性能。与Megatron-LM相比,经ConfigTuner优化的训练配置实现了高达1.36倍的吞吐量提升。与Alpa相比,ConfigTuner在显著降低搜索复杂度的同时,能生成相同的配置建议。