SWE-RM：面向软件工程智能体的免执行反馈 (SWE-RM: Execution-free Feedback For Software Engineering Agents)

Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model's ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.

翻译：基于执行的反馈（如单元测试）通过测试时扩展（TTS）和强化学习（RL）被广泛用于编码智能体的开发。该范式需要可扩展且可靠的单元测试用例收集以提供准确反馈，且由此产生的反馈通常是稀疏的，无法有效区分同为成功或同为失败的轨迹。相比之下，来自奖励模型的免执行反馈能够在不依赖单元测试用例的情况下提供更细粒度的信号。尽管具有这种潜力，面向现实软件工程（SWE）智能体的免执行反馈仍未得到充分探索。然而，在旨在开发对TTS和RL均有效的通用奖励模型时，我们观察到两个在TTS性能上几乎相同的验证器，在RL中却可能产生截然不同的结果。直观上，TTS主要反映模型选择最佳轨迹的能力，但这种能力不一定能推广到RL。为应对这一局限，我们识别出对RL训练至关重要的另外两个方面：分类准确性和校准性。随后，我们进行了全面的受控实验，以研究如何训练一个在这些指标上均表现良好的鲁棒奖励模型。特别地，我们分析了训练数据规模、策略混合以及数据源构成等多种因素的影响。在这些研究的指导下，我们提出了SWE-RM，这是一个准确且鲁棒的奖励模型，采用专家混合架构，总参数量为300亿，推理时激活参数量为30亿。SWE-RM显著提升了SWE智能体在TTS和RL两方面的性能。例如，在SWE-Bench Verified基准上使用TTS时，它将Qwen3-Coder-Flash的准确率从51.6%提升至62.0%，将Qwen3-Coder-Max的准确率从67.0%提升至74.6%，在开源模型中实现了新的最先进性能。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【ACL2022】一个用于远距监督关系抽取的层级对比学习框架, HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction

专知会员服务

15+阅读 · 2022年3月24日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

【CVPR 2022】MixFormer：跨窗口与维度的特征融合，MixFormer: Mixing Features across Windows and Dimensions

专知会员服务

15+阅读 · 2022年3月19日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知会员服务

87+阅读 · 2020年8月28日