While reinforcement learning (RL) shows promise in training tool-use large language models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) -- a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.
翻译:尽管强化学习(RL)在使用可验证结果奖励训练工具使用大语言模型(LLM)方面展现出潜力,但现有方法大多忽视了显式推理奖励对增强推理和工具利用的潜在价值。此外,直接组合推理奖励与结果奖励可能导致次优性能,或与主要优化目标产生冲突。为解决这一问题,我们提出优势加权策略优化(AWPO)——一种原则性的强化学习框架,能有效整合显式推理奖励以提升工具使用能力。AWPO通过方差感知门控和难度感知加权机制,基于组间统计量自适应调节来自推理信号的优势值,并采用定制化的梯度裁剪机制确保优化稳定性。大量实验表明,AWPO在标准工具使用基准测试中取得了最先进的性能,在具有挑战性的多轮交互场景中显著超越现有强基线模型及领先的闭源模型。值得注意的是,在卓越的参数效率下,我们的40亿参数模型在多轮准确率上超越Grok-4达16.0%,同时在分布外MMLU-Pro基准测试中保持了泛化能力。