ReDit：通过奖励抖动改进大语言模型策略优化 (ReDit: Reward Dithering for Improved LLM Policy Optimization)

DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it's a ''perfect'' reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence. The injected noise also introduces stochasticity into flat reward regions, encouraging the model to explore novel policies and escape local optima. Experiments across diverse tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with ReDit. Moreover, theoretical analyses are provided to further validate these advantages.

翻译：DeepSeek-R1 通过其基于规则的奖励系统，成功增强了大语言模型（LLM）的推理能力。虽然这是一个能有效缓解奖励攻击的“完美”奖励系统，但此类奖励函数通常是离散的。我们的实验观察表明，离散奖励可能导致梯度异常、优化不稳定和收敛缓慢。为解决此问题，我们提出了 ReDit（奖励抖动），一种通过添加简单随机噪声对离散奖励信号进行抖动的方法。利用这种扰动后的奖励，整个学习过程中持续提供探索性梯度，从而实现更平滑的梯度更新并加速收敛。注入的噪声还在平坦奖励区域引入随机性，鼓励模型探索新策略并逃离局部最优。跨多种任务的实验证明了 ReDit 的有效性和效率。平均而言，ReDit 仅需约 10% 的训练步数即可达到与原始 GRPO 相当的性能，并且在训练时长相近时，仍比原始 GRPO 表现出 4% 的性能提升。可视化结果证实 ReDit 显著缓解了梯度问题。此外，我们还提供了理论分析以进一步验证这些优势。