Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran -> C++ and C++ -> CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show this data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.
翻译:大型语言模型(LLMs)在代码翻译任务中展现出卓越能力,但其性能在低资源编程领域(如Fortran)和新兴框架(如CUDA)中显著下降,这些领域缺乏高质量并行数据。我们提出一种自动化数据集生成流程,采用双LLM提问者-求解器设计,并整合了来自编译器和运行时反馈的外部知识。相较于传统的源-目标代码对数据集,我们的方法额外生成:(1)带有单元测试的已验证翻译,用于评估功能一致性;(2)多轮对话,捕捉翻译优化背后的推理过程。将该流程应用于Fortran -> C++和C++ -> CUDA翻译任务,分别生成了3.64k和3.93k组对话数据。基于此数据进行的微调显著提升了功能正确性,在极具挑战性的C++到CUDA任务中,单元测试通过率提高了56%以上。实验表明,该数据能使一个7B参数的开源权重模型在编译成功率等关键指标上显著超越规模更大的专有系统。