Code generation is a core capability of large language models (LLMs), yet mainstream benchmarks (e.g., APPs and LiveCodeBench) contain questions with medium-level difficulty and pose no challenge to advanced LLMs. To better reflected the advanced reasoning and code generation ability, We introduce Humanity's Last Code Exam (HLCE), comprising 235 most challenging problems from the International Collegiate Programming Contest (ICPC World Finals) and the International Olympiad in Informatics (IOI) spanning 2010 - 2024. As part of HLCE, we design a harmonized online-offline sandbox that guarantees fully reproducible evaluation. Through our comprehensive evaluation, we observe that even the strongest reasoning LLMs: o4-mini(high) and Gemini-2.5 Pro, achieve pass@1 rates of only 15.9% and 11.4%, respectively. Meanwhile, we propose a novel "self-recognition" task to measure LLMs' awareness of their own capabilities. Results indicate that LLMs' self-recognition abilities are not proportionally correlated with their code generation performance. Finally, our empirical validation of test-time scaling laws reveals that current advanced LLMs have substantial room for improvement on complex programming tasks. We expect HLCE to become a milestone challenge for code generation and to catalyze advances in high-performance reasoning and human-AI collaborative programming. Our code and dataset are also public available(https://github.com/Humanity-s-Last-Code-Exam/HLCE).
翻译:代码生成是大语言模型(LLM)的核心能力,然而主流基准测试(如APPs和LiveCodeBench)包含中等难度问题,对先进LLM不构成挑战。为更好反映高级推理与代码生成能力,我们推出“人类最后的代码考试”(HLCE),包含2010至2024年间国际大学生程序设计竞赛(ICPC世界总决赛)与国际信息学奥林匹克(IOI)中最具挑战性的235道题目。作为HLCE的一部分,我们设计了统一的在线-离线沙箱环境,确保完全可复现的评估。通过全面评估发现,即使最强的推理LLM——o4-mini(high)与Gemini-2.5 Pro,其pass@1准确率也仅分别为15.9%与11.4%。同时,我们提出创新的“自我认知”任务以衡量LLM对自身能力的认知水平。结果表明,LLM的自我认知能力与其代码生成性能并不成正比。最后,我们对测试时缩放定律的实证验证揭示,当前先进LLM在复杂编程任务上仍有巨大提升空间。我们期待HLCE成为代码生成领域的里程碑式挑战,并推动高性能推理与人机协同编程的发展。代码与数据集已公开(https://github.com/Humanity-s-Last-Code-Exam/HLCE)。