PerfBench：智能体能否解决现实世界中的性能缺陷？ (PerfBench: Can Agents Resolve Real-World Performance Bugs?)

Performance bugs are inefficiencies in software that waste computational resources without causing functional failures, making them particularly challenging to detect and fix. While recent advances in Software Engineering agents have shown promise in automated bug fixing, existing benchmarks primarily focus on functional correctness and fail to evaluate agents' abilities to identify and resolve non-functional issues like performance bugs. We introduce PerfBench, a benchmark comprising 81 real-world performance bug-fixing tasks from popular .NET repositories on GitHub. Unlike existing benchmarks that rely on pre-existing test suites, PerfBench features a novel evaluation harness that allows agents to generate their own performance benchmarks and validates fixes by comparing execution metrics collected for developer fix and agent fix. Each task in PerfBench is derived from actual developer fixes linked to performance-related issues, which are then verified by human experts, ensuring real-world relevance. Our evaluation reveals that current state-of-the-art coding agents struggle with performance optimization tasks, with baseline OpenHands agent achieving only a ~3% success rate on our benchmark. We develop OpenHands-Perf-Agent, which incorporates performance-aware tooling and instructions and achieves a ~20% success rate on the benchmark. We show that by ensuring the agent has proper instructions to benchmark its changes and tooling for benchmark output processing, we can improve the agent performance significantly, but room for improvement still remains. PerfBench provides a challenging test set for furthering the capabilities of agents in fixing performance issues.

翻译：性能缺陷是软件中浪费计算资源但不会导致功能故障的低效问题，这使得它们特别难以检测和修复。尽管软件工程智能体在自动修复缺陷方面取得了进展，但现有基准主要关注功能正确性，未能评估智能体识别和解决非功能性问题的能力，例如性能缺陷。我们提出了PerfBench，这是一个包含81个真实世界性能缺陷修复任务的基准，这些任务源自GitHub上流行的.NET代码库。与依赖现有测试套件的现有基准不同，PerfBench采用了一种新颖的评估框架，允许智能体生成自己的性能基准，并通过比较开发者修复和智能体修复收集的执行指标来验证修复效果。PerfBench中的每个任务都源自与性能相关问题相关的实际开发者修复，并由人类专家验证，确保其现实世界相关性。我们的评估显示，当前最先进的编码智能体在性能优化任务上表现不佳，基线OpenHands智能体在我们的基准上仅实现了约3%的成功率。我们开发了OpenHands-Perf-Agent，它集成了性能感知工具和指令，在基准上实现了约20%的成功率。我们表明，通过确保智能体拥有适当的指令来基准测试其更改以及用于基准输出处理的工具，可以显著提高智能体性能，但仍有改进空间。PerfBench为进一步提升智能体修复性能问题的能力提供了一个具有挑战性的测试集。