Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
翻译:由于高质量数据的稀缺性,即使经过精细的数据筛选,大语言模型(LLMs)通常仍使用质量各异的数据混合进行训练。更好地利用高质量数据的一种自然方法是课程式预训练,即按照质量指标确定的升序排列数据对模型进行训练。然而,先前的研究报告称,此类课程式预训练策略带来的改进有限。本研究识别了制约这些方法的一个关键因素:数据质量升序排列与学习率(LR)衰减计划之间的不兼容性。我们发现,虽然在使用恒定学习率时,课程式训练显著优于随机打乱训练,但在标准学习率衰减计划下,其优势会减弱。我们的实验表明,可以通过两种简单策略缓解这种不兼容性:(1)采用更温和的学习率衰减计划,使最终学习率仅略低于峰值学习率;(2)用模型平均替换学习率衰减,即计算最后几个检查点的加权平均值。通过结合这些策略,我们在标准基准测试套件上的平均得分比随机打乱训练提高了1.64%,且无需额外数据精炼。通过在1.5B参数模型上使用多种数据质量指标,在30B tokens的训练数据上进行验证,我们的研究结果呼吁重新评估课程式LLM预训练,并强调了数据课程与优化方法协同设计的潜力。