Defects4C：基于C/C++缺陷的大型语言模型修复能力基准测试 (Defects4C: Benchmarking Large Language Model Repair Capability with C/C++ Bugs)

Automated Program Repair (APR) plays a critical role in enhancing the quality and reliability of software systems. While substantial progress has been made in Java-based APR, largely facilitated by benchmarks like Defects4J, there remains a significant gap in research on C/C++ program repair, despite the widespread use of C/C++ and the prevalence of associated vulnerabilities. This gap is primarily due to the lack of high-quality, open-source benchmarks tailored for C/C++. To address this issue, we introduce Defects4C, a comprehensive and executable benchmark specifically designed for C/C++ program repair. Our dataset is constructed from real-world C/C++ repositories and includes a large collection of bug-relevant commits (9M in total), 248 high-quality buggy functions, and 102 vulnerable functions, all paired with test cases for reproduction. These resources enable rigorous evaluation of repair techniques and support the retraining of learning-based approaches for enhanced performance. Using Defects4C, we conduct a comprehensive empirical study evaluating the effectiveness of 24 state-of-the-art large language models (LLMs) in repairing C/C++ faults. Our findings offer valuable insights into the strengths and limitations of current LLM-based APR techniques in this domain, highlighting both the need for more robust methods and the critical role of Defects4C in advancing future research

翻译：自动程序修复（APR）在提升软件系统质量与可靠性方面发挥着关键作用。尽管基于Java的APR研究已取得显著进展（主要得益于Defects4J等基准数据集），但针对C/C++程序修复的研究仍存在明显不足——尽管C/C++语言应用广泛且相关漏洞普遍存在。这一差距主要源于缺乏专门针对C/C++的高质量开源基准数据集。为解决此问题，我们提出了Defects4C：一个专为C/C++程序修复设计的全面且可执行的基准测试集。该数据集构建于真实世界的C/C++代码仓库，包含大规模缺陷相关提交（总计900万次）、248个高质量缺陷函数及102个易受攻击函数，所有案例均配有用于复现的测试用例。这些资源支持修复技术的严格评估，并可用于基于学习方法的重训练以提升性能。基于Defects4C，我们开展了系统性实证研究，评估了24种前沿大型语言模型（LLM）在修复C/C++缺陷方面的有效性。研究结果揭示了当前基于LLM的APR技术在该领域的优势与局限，既凸显了对更稳健方法的需求，也论证了Defects4C在推动未来研究中的关键作用。