估计器之喜剧：论大语言模型强化学习训练中的KL正则化 (A Comedy of Estimators: On KL Regularization in RL Training of LLMs)

Vedant Shah,Johan Obando-Ceron,Vineet Jain,Brian Bartoldson,Bhavya Kailkhura,Sarthak Mittal,Glen Berseth,Pablo Samuel Castro,Yoshua Bengio,Nikolay Malkin,Moksh Jain,Siddarth Venkatraman,Aaron Courville

The reasoning performance of large language models (LLMs) can be substantially improved by training them with reinforcement learning (RL). The RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy. Since computing the KL divergence exactly is intractable, various estimators are used in practice to estimate it from on-policy samples. Despite its wide adoption, including in several open-source libraries, there is no systematic study analyzing the numerous ways of incorporating KL estimators in the objective and their effect on the downstream performance of RL-trained models. Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation. In this paper, we further analyze these practices and study the gradients of several estimators configurations, revealing how design choices shape gradient bias. We substantiate these findings with empirical observations by RL fine-tuning \texttt{Qwen2.5-7B}, \texttt{Llama-3.1-8B-Instruct} and \texttt{Qwen3-4B-Instruct-2507} with different configurations and evaluating their performance on both in- and out-of-distribution tasks. Through our analysis, we observe that, in on-policy settings: (1) estimator configurations with biased gradients can result in training instabilities; and (2) using estimator configurations resulting in unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks. We also investigate the performance resulting from different KL configurations in off-policy settings and observe that KL regularization can help stabilize off-policy RL training resulting from asynchronous setups.

翻译：通过强化学习（RL）训练可显著提升大语言模型（LLMs）的推理性能。LLM训练的RL目标函数包含一个正则化项，即训练策略与参考策略之间的反向Kullback-Leibler（KL）散度。由于精确计算KL散度不可行，实践中常采用多种估计器基于在线策略样本进行估计。尽管该方法已被广泛采纳（包括多个开源库），目前仍缺乏系统研究来分析KL估计器融入目标函数的不同方式及其对RL训练模型下游性能的影响。近期研究表明，当前主流的KL正则化实践并未为既定目标函数提供正确的梯度，导致目标与实现之间存在偏差。本文进一步分析这些实践，研究多种估计器配置的梯度特性，揭示设计选择如何影响梯度偏差。我们通过RL微调\texttt{Qwen2.5-7B}、\texttt{Llama-3.1-8B-Instruct}和\texttt{Qwen3-4B-Instruct-2507}模型的不同配置并评估其在分布内与分布外任务上的表现，为这些发现提供实证依据。通过分析我们观察到，在在线策略设置中：（1）具有偏差梯度的估计器配置可能导致训练不稳定；（2）采用能产生无偏梯度的估计器配置可在分布内及分布外任务上获得更优性能。我们还研究了离线策略设置中不同KL配置的性能表现，发现KL正则化有助于稳定异步设置产生的离线策略RL训练。