GreedySnake：通过高效调度与优化器步骤重叠加速基于SSD卸载的大语言模型训练 (GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping)

SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B. The code is open-sourced at https://github.com/npz7yyk/GreedySnake

翻译：基于SSD卸载的训练为降低大语言模型训练成本提供了一种实用且前景广阔的方法。本文在基于微批次的梯度累积技术基础上，提出GreedySnake——一种采用垂直调度策略的新型SSD卸载训练系统，该策略在执行完某一层的所有微批次计算后才进入下一层。相较于采用水平调度（即按顺序执行微批次）的现有系统，GreedySnake能够在更小的批次规模下实现更高的训练吞吐量，使系统性能更接近屋顶线模型预测的理想场景。为缓解I/O瓶颈，GreedySnake进一步将部分优化器计算步骤与下一轮迭代的前向传播过程重叠执行。在A100 GPU上的实验结果表明：对于GPT-65B模型，GreedySnake相比ZeRO-Infinity实现了饱和训练吞吐量提升——单GPU配置下提升1.96倍，4 GPU配置下提升1.93倍；对于GPT-175B模型，单GPU配置下提升达2.53倍。代码已开源：https://github.com/npz7yyk/GreedySnake