Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. We release our code and model weights at https://github.com/CJReinforce/PURE.
翻译:过程奖励模型(PRMs)已被证明在挑战性推理任务上对大型语言模型(LLMs)进行测试时扩展是有效的。然而,PRMs的奖励破解问题限制了其在强化微调中的成功应用。本文中,我们识别了PRM引发奖励破解的主要原因:强化学习(RL)中典型的求和形式信用分配,其将价值定义为累积伽马衰减的未来奖励,容易诱导LLMs破解高奖励步骤。为解决此问题,我们提出PURE:过程监督强化学习。PURE的核心创新是一种最小形式信用分配,将价值函数公式化为未来奖励的最小值。该方法通过限制价值函数范围并更合理地分配优势,显著缓解了奖励破解。通过在3个基础模型上的大量实验,我们表明,基于PRM的方法结合最小形式信用分配,仅需30%的步骤即可达到与基于可验证奖励方法相当的推理性能。相比之下,典型的求和形式信用分配甚至在训练初期就导致崩溃!此外,当我们仅用10%的可验证奖励补充基于PRM的微调时,我们进一步缓解了奖励破解,并在实验中基于Qwen2.5-Math-7B产生了最佳微调模型,在AMC23上达到82.5%的准确率,在5个基准测试中平均准确率为53.3%。此外,我们总结了观察到的奖励破解案例,并分析了训练崩溃的原因。我们在https://github.com/CJReinforce/PURE发布了代码和模型权重。