评估大语言模型在真实场景下的代码推理能力 (Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings)

Code reasoning tasks are becoming prevalent in large language model (LLM) assessments. Existing benchmarks involve simple programs, failing to represent real-world complexities such as inter- or intra-procedural dependencies, core or third-party API calls, highly nested constructs, and non-primitive complex types. Evaluating LLMs under such a simplistic setting poses a significant threat to assumptions about their generalizability in practice. To enable a more realistic evaluation of code reasoning, this paper proposes RE2-Bench, a benchmark of 1,101 reasoning problems, including 195 drawn from mature real-world projects. RE2-Bench leverages static and dynamic program analysis to automatically serialize and deserialize compound, complex, and custom types in real-world code, going far beyond the primitive-only settings used in prior work. A key feature of RE2-Bench is categorizing each reasoning problem as Easy or Hard via a principled majority-vote mechanism over nine interpretable code complexity metrics, resulting in two well-separated and semantically meaningful difficulty categories suitable for precise calibration of LLM reasoning ability. A comprehensive evaluation of six general-purpose and reasoning-oriented LLMs on two widely used code reasoning tasks -- input prediction and output prediction -- using RE2-Bench reveals a significant performance drop from Easy to Hard problems (51.50\% for input prediction and 42.15\% for output prediction), confirming that prior evaluations substantially overestimate the reasoning capabilities of LLMs.

翻译：代码推理任务在大语言模型评估中日益普遍。现有基准测试涉及简单程序，未能体现真实世界的复杂性，例如过程间或过程内依赖、核心或第三方API调用、高度嵌套结构以及非原始复杂类型。在此类简化设置下评估大语言模型，对其实际泛化能力的假设构成重大威胁。为实现更真实的代码推理评估，本文提出RE2-Bench基准测试集，包含1,101个推理问题，其中195个源自成熟的实际项目。RE2-Bench利用静态和动态程序分析技术，自动序列化和反序列化实际代码中的复合型、复杂型和自定义类型，远超先前工作中仅使用原始类型的设置。RE2-Bench的关键特性是通过基于九项可解释代码复杂度指标的多数投票机制，将每个推理问题分类为简单或困难，形成两个区分明确且语义清晰的难度类别，适用于精确校准大语言模型的推理能力。使用RE2-Bench对六个通用型和推理导向型大语言模型，在两项广泛使用的代码推理任务——输入预测与输出预测——上进行综合评估，结果显示从简单到困难问题的性能显著下降（输入预测下降51.50%，输出预测下降42.15%），证实了先前评估严重高估了大语言模型的推理能力。