The rapid advancement of code large language models (LLMs) has sparked significant research interest in systematically evaluating their code generation capabilities, yet existing benchmarks predominantly assess models at a single structural granularity and focus on limited programming languages, obscuring fine-grained capability variations across different code scopes and multilingual scenarios. We introduce M2G-Eval, a multi-granularity, multilingual framework for evaluating code generation in large language models (LLMs) across four levels: Class, Function, Block, and Line. Spanning 18 programming languages, M2G-Eval includes 17K+ training tasks and 1,286 human-annotated, contamination-controlled test instances. We develop M2G-Eval-Coder models by training Qwen3-8B with supervised fine-tuning and Group Relative Policy Optimization. Evaluating 30 models (28 state-of-the-art LLMs plus our two M2G-Eval-Coder variants) reveals three main findings: (1) an apparent difficulty hierarchy, with Line-level tasks easiest and Class-level most challenging; (2) widening performance gaps between full- and partial-granularity languages as task complexity increases; and (3) strong cross-language correlations, suggesting that models learn transferable programming concepts. M2G-Eval enables fine-grained diagnosis of code generation capabilities and highlights persistent challenges in synthesizing complex, long-form code.
翻译:代码大语言模型(LLMs)的快速发展引发了对其代码生成能力进行系统性评估的显著研究兴趣。然而,现有基准测试主要评估模型在单一结构粒度上的表现,且集中于有限的编程语言,这掩盖了模型在不同代码作用域和多语言场景下的细粒度能力差异。我们提出了M2G-Eval,一个多粒度、多语言的框架,用于在类(Class)、函数(Function)、代码块(Block)和代码行(Line)四个层级上评估大语言模型的代码生成能力。M2G-Eval覆盖18种编程语言,包含超过17,000个训练任务以及1,286个人工标注、污染受控的测试实例。我们通过监督微调和组相对策略优化训练Qwen3-8B,开发了M2G-Eval-Coder模型。对30个模型(28个最先进的大语言模型以及我们的两个M2G-Eval-Coder变体)的评估揭示了三个主要发现:(1)存在明显的难度层级,代码行级任务最容易,而类级任务最具挑战性;(2)随着任务复杂性增加,全粒度语言与部分粒度语言之间的性能差距扩大;(3)存在很强的跨语言相关性,表明模型学习了可迁移的编程概念。M2G-Eval能够对代码生成能力进行细粒度诊断,并凸显了在合成复杂、长篇代码方面持续存在的挑战。