Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model's ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.
翻译:基于执行的反馈(如单元测试)通过测试时扩展(TTS)和强化学习(RL)被广泛用于编码智能体的开发。该范式需要可扩展且可靠的单元测试用例收集以提供准确反馈,且由此产生的反馈通常是稀疏的,无法有效区分同为成功或同为失败的轨迹。相比之下,来自奖励模型的免执行反馈能够在不依赖单元测试用例的情况下提供更细粒度的信号。尽管具有这种潜力,面向现实软件工程(SWE)智能体的免执行反馈仍未得到充分探索。然而,在旨在开发对TTS和RL均有效的通用奖励模型时,我们观察到两个在TTS性能上几乎相同的验证器,在RL中却可能产生截然不同的结果。直观上,TTS主要反映模型选择最佳轨迹的能力,但这种能力不一定能推广到RL。为应对这一局限,我们识别出对RL训练至关重要的另外两个方面:分类准确性和校准性。随后,我们进行了全面的受控实验,以研究如何训练一个在这些指标上均表现良好的鲁棒奖励模型。特别地,我们分析了训练数据规模、策略混合以及数据源构成等多种因素的影响。在这些研究的指导下,我们提出了SWE-RM,这是一个准确且鲁棒的奖励模型,采用专家混合架构,总参数量为300亿,推理时激活参数量为30亿。SWE-RM显著提升了SWE智能体在TTS和RL两方面的性能。例如,在SWE-Bench Verified基准上使用TTS时,它将Qwen3-Coder-Flash的准确率从51.6%提升至62.0%,将Qwen3-Coder-Max的准确率从67.0%提升至74.6%,在开源模型中实现了新的最先进性能。