Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find that only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance at inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating their capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling.
翻译:大型语言模型(LLMs)通过后训练和测试时扩展定律,在复杂推理任务中展现出显著进展。尽管主流的测试时扩展方法通常通过使用外部奖励模型来引导模型生成过程,但我们发现,在对特定推理任务进行后训练的模型上进行扩展时,仅能获得有限的增益。我们指出,这种有限的改进源于特定后训练生成器与通用奖励模型之间的分布差异。为解决此问题,我们提出了一个激励LLMs自我验证其答案的框架。通过将答案生成与验证统一在单一强化学习(RL)过程中,我们训练出能够有效评估自身解决方案正确性的模型。训练后的模型可以在推理时通过验证其生成内容进一步扩展性能,而无需依赖外部验证器。我们基于Qwen2.5-Math-7B和DeepSeek-R1-Distill-Qwen-1.5B训练了自我验证模型,展示了它们在不同推理上下文长度下的能力。在多个数学推理基准测试上的实验表明,我们的模型不仅能提升后训练性能,还能实现有效的测试时扩展。