Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. We train over 100 LMs with 1B and 4B parameters from scratch, and evaluate both upstream (language modeling) and downstream (problem-solving) capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.
翻译:现代语言模型(LM)的训练被划分为多个阶段,这使得下游开发者难以评估各阶段设计选择的影响。我们提出了EvoLM,一个模型套件,能够系统且透明地分析语言模型在预训练、持续预训练、监督微调和强化学习等阶段的训练动态。我们从头训练了超过100个参数量为1B和4B的语言模型,并评估了其上游(语言建模)和下游(问题解决)能力,包括对领域内和领域外泛化性的考量。关键发现揭示了过度预训练和后训练的收益递减现象、领域特定持续预训练中缓解遗忘的重要性与方法、持续预训练在连接预训练与后训练阶段的关键作用,以及配置监督微调和强化学习时的各种复杂权衡。为促进开放研究和可复现性,我们发布了所有预训练和后训练模型、各阶段的训练数据集,以及完整的训练与评估流程。