QiMeng-NeuComBack：从中间表示到汇编代码的自进化翻译 (QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code)

Compilers, while essential, are notoriously complex systems that demand prohibitively expensive human expertise to develop and maintain. The recent advancements in Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation, which could potentially simplify compiler development for new architectures and facilitate the discovery of innovative optimization techniques. However, several critical obstacles impede its practical adoption. Firstly, a significant lack of dedicated benchmarks and robust evaluation methodologies hinders objective assessment and tracking of progress in the field. Secondly, systematically enhancing the reliability and performance of LLM-generated assembly remains a critical challenge. Addressing these challenges, this paper introduces NeuComBack, a novel benchmark dataset specifically designed for IR-to-assembly compilation. Leveraging this dataset, we first define a foundational Neural Compilation workflow and conduct a comprehensive evaluation of the capabilities of recent frontier LLMs on Neural Compilation, establishing new performance baselines. We further propose a self-evolving prompt optimization method that enables LLMs to iteratively evolve their internal prompt strategies by extracting insights from prior self-debugging traces, thereby enhancing their neural compilation capabilities. Experiments demonstrate that our method significantly improves both the functional correctness and the performance of LLM-generated assembly code. Compared to baseline prompts, the functional correctness rates improved from 44% to 64% on x86_64 and from 36% to 58% on aarch64, respectively. More significantly, among the 16 correctly generated x86_64 programs using our method, 14 (87.5%) surpassed clang-O3 performance.

翻译：编译器作为核心系统至关重要，但其开发与维护通常需要极高成本的人力专业知识，导致系统极为复杂。近年来，大型语言模型（LLM）的进展提供了一种引人注目的新范式：神经编译，它有望简化面向新架构的编译器开发，并促进创新优化技术的发现。然而，若干关键障碍阻碍了其实际应用。首先，该领域严重缺乏专用基准测试和稳健的评估方法，这妨碍了对进展的客观评估与追踪。其次，系统性地提升LLM生成汇编代码的可靠性与性能仍是一个严峻挑战。为应对这些挑战，本文提出了NeuComBack，一个专门为中间表示到汇编编译设计的新型基准数据集。利用该数据集，我们首先定义了一个基础的神经编译工作流程，并对近期前沿LLM在神经编译任务上的能力进行了全面评估，确立了新的性能基线。我们进一步提出了一种自进化提示优化方法，该方法使LLM能够通过从先前的自调试轨迹中提取洞见，迭代进化其内部提示策略，从而增强其神经编译能力。实验表明，我们的方法显著提升了LLM生成汇编代码的功能正确性与性能。相较于基线提示，功能正确率在x86_64架构上从44%提升至64%，在aarch64架构上从36%提升至58%。更重要的是，在使用我们方法正确生成的16个x86_64程序中，有14个（87.5%）的性能超越了clang-O3的优化水平。