Large language models (LLMs) have made significant strides in code generation, achieving impressive capabilities in synthesizing code snippets from natural language instructions. However, a critical challenge remains in ensuring LLMs generate factually accurate responses about programming concepts, technical implementations, etc. Most previous code-related benchmarks focus on code execution correctness, overlooking the factual accuracy of programming knowledge. To address this gap, we present CodeSimpleQA, a comprehensive bilingual benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions, which contains carefully curated question-answer pairs in both English and Chinese, covering diverse programming languages and major computer science domains. Further, we create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning. Our comprehensive evaluation of diverse LLMs reveals that even frontier LLMs struggle with code factuality. Our proposed framework demonstrates substantial improvements over the base model, underscoring the critical importance of factuality-aware alignment in developing reliable code LLMs.
翻译:大语言模型在代码生成方面取得了显著进展,在根据自然语言指令合成代码片段方面展现出令人印象深刻的能力。然而,确保大语言模型生成关于编程概念、技术实现等方面的事实准确回应,仍然是一个关键挑战。先前大多数与代码相关的基准测试主要关注代码执行正确性,而忽视了编程知识的事实准确性。为弥补这一空白,我们提出了CodeSimpleQA——一个全面的双语基准测试,旨在评估代码大语言模型在回答代码相关问题时的实际准确性。该基准包含精心构建的英文和中文问答对,涵盖多种编程语言和主要计算机科学领域。此外,我们创建了包含6600万个样本的大规模指令语料库CodeSimpleQA-Instruct,并开发了一个结合监督微调和强化学习的后训练框架。我们对多种大语言模型的综合评估表明,即使是前沿的大语言模型在处理代码事实性方面也存在困难。我们提出的框架相较于基础模型展现出显著改进,这凸显了在开发可靠的代码大语言模型过程中,事实感知对齐具有至关重要的意义。