Cerberus：基于多智能体推理与覆盖引导探索的运行时错误静态检测方法 (Cerberus: Multi-Agent Reasoning and Coverage-Guided Exploration for Static Detection of Runtime Errors)

In several software development scenarios, it is desirable to detect runtime errors and exceptions in code snippets without actual execution. A typical example is to detect runtime exceptions in online code snippets before integrating them into a codebase. In this paper, we propose Cerberus, a novel predictive, execution-free coverage-guided testing framework. Cerberus uses LLMs to generate the inputs that trigger runtime errors and to perform code coverage prediction and error detection without code execution. With a two-phase feedback loop, Cerberus first aims to both increasing code coverage and detecting runtime errors, then shifts to focus only detecting runtime errors when the coverage reaches 100% or its maximum, enabling it to perform better than prompting the LLMs for both purposes. Our empirical evaluation demonstrates that Cerberus performs better than conventional and learning-based testing frameworks for (in)complete code snippets by generating high-coverage test cases more efficiently, leading to the discovery of more runtime errors.

翻译：在多种软件开发场景中，无需实际执行即可检测代码片段中的运行时错误与异常具有重要价值。典型应用场景包括在将在线代码片段集成至代码库前检测其潜在的运行时异常。本文提出Cerberus——一种创新的预测性、免执行的覆盖引导测试框架。该框架利用大语言模型生成可触发运行时错误的输入数据，并在不执行代码的情况下实现代码覆盖率预测与错误检测。通过两阶段反馈循环机制，Cerberus首先同步提升代码覆盖率并检测运行时错误；当覆盖率达成100%或达到上限时，则转为专注检测运行时错误。这种策略使其在两项任务上的表现均优于直接提示大语言模型同时完成两种目标的方案。实证评估表明，针对（不）完整代码片段，Cerberus能比传统测试框架及基于学习的测试框架更高效地生成高覆盖率测试用例，从而发现更多运行时错误。