从代码到正确性：通过分层调试弥合代码生成的最后一英里 (From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging)

While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.

翻译：尽管大型语言模型在代码生成方面取得了显著进展，但生成代码的通过率受限于细微错误，通常需要人工干预才能通过测试，尤其是在复杂问题中。现有的基于LLM的调试系统将生成的程序视为整体单元，未能处理从低级语法错误到高级算法缺陷的多粒度错误。本文提出多粒度调试器（MGDebugger），一种通过在不同粒度级别隔离、识别和修复错误的分层代码调试器。MGDebugger将有问题的代码分解为子函数的层次树结构，每一层代表特定粒度的错误。在调试过程中，它分析每个子函数并以自底向上的方式迭代修复错误。为了有效测试每个子函数，我们提出了一种LLM模拟的Python执行器，该执行器追踪代码执行并跟踪重要变量状态以准确定位错误。大量实验表明，MGDebugger优于现有调试系统，在HumanEval上相对于初始生成实现了18.9%的准确率提升，在HumanEvalFix中达到了97.6%的修复成功率。此外，MGDebugger能有效修复不同类别和难度级别的错误，证明了其鲁棒性和有效性。