The growth of Large Language Model (LLM) technology has raised expectations for automated coding. However, software engineering is more than coding and is concerned with activities including maintenance and evolution of a project. In this context, the concept of LLM agents has gained traction, which utilize LLMs as reasoning engines to invoke external tools autonomously. But is an LLM agent the same as an AI software engineer? In this paper, we seek to understand this question by developing a Unified Software Engineering agent or USEagent. Unlike existing work which builds specialized agents for specific software tasks such as testing, debugging, and repair, our goal is to build a unified agent which can orchestrate and handle multiple capabilities. This gives the agent the promise of handling complex scenarios in software development such as fixing an incomplete patch, adding new features, or taking over code written by others. We envision USEagent as the first draft of a future AI Software Engineer which can be a team member in future software development teams involving both AI and humans. To evaluate the efficacy of USEagent, we build a Unified Software Engineering bench (USEbench) comprising of myriad tasks such as coding, testing, and patching. USEbench is a judicious mixture of tasks from existing benchmarks such as SWE-bench, SWT-bench, and REPOCOD. In an evaluation on USEbench consisting of 1,271 repository-level software engineering tasks, USEagent shows improved efficacy compared to existing general agents such as OpenHands CodeActAgent. There exist gaps in the capabilities of USEagent for certain coding tasks, which provides hints on further developing the AI Software Engineer of the future.
翻译:大型语言模型(LLM)技术的发展提高了对自动化编码的期望。然而,软件工程不仅仅是编码,还涉及包括项目维护和演进在内的多种活动。在此背景下,LLM智能体的概念日益受到关注,其利用LLM作为推理引擎自主调用外部工具。但LLM智能体是否等同于AI软件工程师?本文通过开发统一软件工程智能体(USEagent)来探讨这一问题。与现有针对特定软件任务(如测试、调试和修复)构建专用智能体的研究不同,我们的目标是构建一个能够协调和处理多种能力的统一智能体。这使得该智能体有望处理软件开发中的复杂场景,例如修复不完整的补丁、添加新功能或接管他人编写的代码。我们将USEagent视为未来AI软件工程师的初步雏形,它可能成为未来涉及AI与人类的软件开发团队中的一员。为评估USEagent的有效性,我们构建了统一软件工程基准(USEbench),涵盖编码、测试和修补等多样化任务。USEbench是现有基准(如SWE-bench、SWT-bench和REPOCOD)中任务的审慎组合。在包含1,271个仓库级软件工程任务的USEbench评估中,USEagent相较于现有通用智能体(如OpenHands CodeActAgent)表现出更高的效能。USEagent在某些编码任务上仍存在能力差距,这为未来进一步开发AI软件工程师提供了方向性启示。