Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
翻译:通过后训练使语言模型(LMs)适应新任务时,存在降低现有能力的风险——这一现象在经典理论中被称为灾难性遗忘。本文旨在为缓解此现象提供指导原则,我们系统比较了两种广泛采用的后训练方法:监督微调(SFT)和强化学习(RL)的遗忘模式。实验表明,在不同语言模型系列(Llama、Qwen)和任务(指令遵循、通用知识及算术推理)中,RL 在实现相当或更高目标任务性能的同时,比 SFT 引发更少的遗忘。为探究此差异的原因,我们考虑一个简化设定:将语言模型建模为两个分布的混合,一个对应先验知识,另一个对应目标任务。我们发现,RL 因其使用在线策略数据而产生的模式寻求特性,能够在学习目标任务时保持先验知识不变。随后,我们通过实证验证了这一见解:在实际场景中,RL 对遗忘的鲁棒性源于其使用在线策略数据,而非其他算法选择(如 KL 正则化或优势估计)。最后,作为实际应用启示,我们的结果强调了使用近似在线策略数据缓解遗忘的潜力,这类数据的获取效率可能远高于完全在线策略数据。