停止求和：最小形式信用分配是过程奖励模型推理所需的全部 (Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning)

Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. We release our code and model weights at https://github.com/CJReinforce/PURE.

翻译：过程奖励模型（PRMs）已被证明在挑战性推理任务上对大型语言模型（LLMs）进行测试时扩展是有效的。然而，PRMs的奖励破解问题限制了其在强化微调中的成功应用。本文中，我们识别了PRM引发奖励破解的主要原因：强化学习（RL）中典型的求和形式信用分配，其将价值定义为累积伽马衰减的未来奖励，容易诱导LLMs破解高奖励步骤。为解决此问题，我们提出PURE：过程监督强化学习。PURE的核心创新是一种最小形式信用分配，将价值函数公式化为未来奖励的最小值。该方法通过限制价值函数范围并更合理地分配优势，显著缓解了奖励破解。通过在3个基础模型上的大量实验，我们表明，基于PRM的方法结合最小形式信用分配，仅需30%的步骤即可达到与基于可验证奖励方法相当的推理性能。相比之下，典型的求和形式信用分配甚至在训练初期就导致崩溃！此外，当我们仅用10%的可验证奖励补充基于PRM的微调时，我们进一步缓解了奖励破解，并在实验中基于Qwen2.5-Math-7B产生了最佳微调模型，在AMC23上达到82.5%的准确率，在5个基准测试中平均准确率为53.3%。此外，我们总结了观察到的奖励破解案例，并分析了训练崩溃的原因。我们在https://github.com/CJReinforce/PURE发布了代码和模型权重。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日