Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.
翻译:现有的人工智能代码智能体基准测试主要关注孤立的单一问题任务,例如修复一个错误或实现一个小功能。然而,现实世界的软件工程本质上是一项长周期的工作:开发者必须解读高层次需求,规划跨多个文件的协调变更,并在保持现有功能的同时,通过多次迭代演进代码库。我们提出了SWE-EVO,这是一个评估智能体应对此类长周期软件演化挑战的基准。该基准基于七个成熟的Python开源项目的发布说明和版本历史构建,包含48个演化任务,要求智能体实现平均跨越21个文件的多步骤修改,并通过平均每个实例包含874个测试的全面测试套件进行验证。使用最先进模型进行的实验揭示了一个显著的能力差距:即使是配备了OpenHands的GPT-5,在该基准上的解决率也仅为21%,而在单一问题基准SWE-Bench Verified上的解决率为65%。这表明当前的智能体在持续、多文件的推理方面存在困难。我们还提出了修复率这一细粒度指标,用于捕捉解决这些复杂、长周期任务过程中的部分进展。