Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean for each prompt. Statistically, this centering acts as a control variate (or baseline), reducing the variance of the policy-gradient estimator. Typically, the mean reward is estimated using per-prompt empirical averages for each prompt in a batch. Drawing inspiration from Stein's paradox, we propose using shrinkage estimators that combine per-prompt and across-prompt means to improve the overall per-prompt mean estimation accuracy -- particularly in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our proposed baseline serves as a drop-in replacement for existing per-prompt mean baselines, requiring no additional hyper-parameters or computation. Empirically, shrinkage baselines consistently outperform standard empirical-mean baselines, leading to lower-variance gradient updates and improved training stability.
翻译:可验证奖励强化学习(RLVR)已成为利用GRPO等策略梯度方法对大型推理模型进行后训练的强大范式。为稳定训练,这些方法通常通过减去每个提示的样本均值来对轨迹奖励进行中心化处理。从统计学角度看,该中心化操作充当了控制变量(或基线)的作用,降低了策略梯度估计量的方差。通常,均值奖励是通过批次中每个提示的逐提示样本平均值来估计的。受Stein悖论的启发,我们提出使用收缩估计量,该估计量结合了逐提示与跨提示均值,以提高整体逐提示均值估计的准确性——尤其是在RLVR常见的低生成样本场景中。理论上,我们构建了一种基于收缩的基线,可证明在不同算法中产生方差更低的策略梯度估计量。所提出的基线可直接替代现有的逐提示均值基线,无需额外超参数或计算开销。实证结果表明,收缩基线在多个任务中持续优于标准样本均值基线,实现了更低方差的梯度更新与更优的训练稳定性。