Global communication, such as all-reduce and allgather, is the prominent performance bottleneck in large language model (LLM) pretraining. To address this issue, we present Pier, an efficient and scalable optimizer with relaxed global communication. Pier is built upon DiLoCo, which leverages an inner optimizer within groups of processors and an outer optimizer that requires global communication. To preserve the convergence and model performance, Pier incorporates two key techniques for the outer optimizer: momentum warmup and momentum decay. Pier employs an efficient and scalable system architecture to enable complex parallelization strategies in LLM pretraining. We examine the model performance and runtime reduction of Pier using the GPT model family (e.g., small, medium, XL, and 7B) and the OpenWebText dataset with a suite of thirteen downstream tasks. With data parallel strategy, Pier speeds up GPT-2 XL training by up to 2.7x-3.7x on 256 NVIDIA A100 GPUs and 1.2x-1.9x on 64 GH200 Superchips, respectively, without degradation of validation loss or downstream task performance. With data parallel and tensor parallel, Pier reduces the time cost GPT-2 7B model training by 54.5% on 128 A100s.
翻译:全局通信(如全规约和全收集)是大语言模型(LLM)预训练中的主要性能瓶颈。为解决这一问题,我们提出了Pier,一种基于松弛全局通信的高效可扩展优化器。Pier建立在DiLoCo基础上,后者利用处理器组内的内部优化器和需要全局通信的外部优化器。为保持收敛性和模型性能,Pier为外部优化器引入了两项关键技术:动量预热和动量衰减。Pier采用高效可扩展的系统架构,以支持LLM预训练中复杂的并行化策略。我们使用GPT模型家族(例如small、medium、XL和7B)及OpenWebText数据集,结合十三项下游任务套件,评估了Pier的模型性能与运行时缩减效果。在数据并行策略下,Pier在256个NVIDIA A100 GPU上将GPT-2 XL训练加速至多2.7倍至3.7倍,在64个GH200超级芯片上加速1.2倍至1.9倍,且验证损失或下游任务性能未出现下降。在数据并行与张量并行结合下,Pier在128个A100上将GPT-2 7B模型训练的时间成本降低了54.5%。