大型语言模型赋能的自主系统软件工程中的基准测试与解决方案综述 (A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System)

The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is hindered by a lack of comprehensive understanding of how benchmarks and solutions interconnect. This survey addresses this gap by providing the first holistic analysis of LLM-powered software engineering, offering insights into evaluation methodologies and solution paradigms. We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair. Our analysis highlights the evolution from simple prompt engineering to sophisticated agentic systems incorporating capabilities like planning, reasoning, memory mechanisms, and tool augmentation. To contextualize this progress, we present a unified pipeline illustrating the workflow from task specification to deliverables, detailing how different solution paradigms address various complexity levels. Unlike prior surveys that focus narrowly on specific aspects, this work connects 50+ benchmarks to their corresponding solution strategies, enabling researchers to identify optimal approaches for diverse evaluation criteria. We also identify critical research gaps and propose future directions, including multi-agent collaboration, self-evolving systems, and formal verification integration. This survey serves as a foundational guide for advancing LLM-driven software engineering. We maintain a GitHub repository that continuously updates the reviewed and related papers at https://github.com/lisaGuojl/LLM-Agent-SE-Survey.

翻译：将大型语言模型（LLMs）集成到软件工程中，推动了从传统基于规则的系统向能够解决复杂问题的自主智能体系统的转变。然而，由于缺乏对基准测试与解决方案之间相互关联的系统性理解，这一领域的系统性进展受到阻碍。本综述通过首次对LLM赋能的软件工程进行整体性分析，填补了这一空白，为评估方法和解决方案范式提供了见解。我们回顾了150多篇近期论文，并提出了一个基于两个关键维度的分类体系：（1）解决方案，分为基于提示、基于微调和基于智能体的范式；（2）基准测试，包括代码生成、翻译和修复等任务。我们的分析强调了从简单的提示工程到融合了规划、推理、记忆机制和工具增强等能力的复杂智能体系统的演进。为了阐明这一进展，我们提出了一个统一的流程框架，展示了从任务规约到交付成果的工作流程，详细说明了不同解决方案范式如何应对不同复杂度的任务。与先前仅聚焦于特定方面的综述不同，本研究将50多个基准测试与其对应的解决策略联系起来，使研究人员能够针对不同的评估标准确定最优方法。我们还指出了关键的研究空白，并提出了未来的研究方向，包括多智能体协作、自进化系统以及形式化验证的集成。本综述为推进LLM驱动的软件工程提供了基础性指南。我们在GitHub上维护了一个资源库，持续更新已综述及相关论文，地址为：https://github.com/lisaGuojl/LLM-Agent-SE-Survey。

相关内容

Engineering

关注 6

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日