激励大型语言模型自我验证其答案 (Incentivizing LLMs to Self-Verify Their Answers)

Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find that only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance at inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating their capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling.

翻译：大型语言模型（LLMs）通过后训练和测试时扩展定律，在复杂推理任务中展现出显著进展。尽管主流的测试时扩展方法通常通过使用外部奖励模型来引导模型生成过程，但我们发现，在对特定推理任务进行后训练的模型上进行扩展时，仅能获得有限的增益。我们指出，这种有限的改进源于特定后训练生成器与通用奖励模型之间的分布差异。为解决此问题，我们提出了一个激励LLMs自我验证其答案的框架。通过将答案生成与验证统一在单一强化学习（RL）过程中，我们训练出能够有效评估自身解决方案正确性的模型。训练后的模型可以在推理时通过验证其生成内容进一步扩展性能，而无需依赖外部验证器。我们基于Qwen2.5-Math-7B和DeepSeek-R1-Distill-Qwen-1.5B训练了自我验证模型，展示了它们在不同推理上下文长度下的能力。在多个数学推理基准测试上的实验表明，我们的模型不仅能提升后训练性能，还能实现有效的测试时扩展。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日