超越语言边界：揭示面向代码语言模型的编程语言家族 (Beyond Language Boundaries: Uncovering Programming Language Families for Code Language Models)

The rapid proliferation of diverse programming languages presents both opportunities and challenges for developing multilingual code LLMs. While existing techniques often train code LLMs by simply aggregating multilingual code data, few explore the deeper relationships between programming languages(PLs) and how such relationships can be utilized to optimize the training and inference of code LLMs. In this work, we investigate 2 fundamental questions: 1) What are the deep linguistic relationships among PLs? and 2) How can these relationships be leveraged to improve multilingual code LLMs? We propose an embedding-based framework to uncover the latent families of PLs. Our approach begins by defining 21 primary linguistic features of programming languages, such as variable definition, control structures, and method declarations, and then employs LLMs to generate feature-aligned code samples across multiple languages. By embedding these semantically parallel code snippets from 19 languages, we construct a similarity matrix and perform hierarchical clustering to uncover inherent language relationships. Our analysis reveals clear hierarchical structures among programming languages. Closely related languages form well-defined clusters (e.g., C, C++, Java, and Swift group together), while Go exhibits as a central language with the highest cross-language similarity. Building on the uncovered language families, we propose three strategies to enhance multilingual LLM training: transfer learning across linguistically related languages, linguistic proximity-guided curriculum learning, and centroid-based intermediary code translation. Experiments on 4 code intelligence tasks demonstrate that our methods significantly improve multilingual LLM performance. This work offers a universal perspective on programming languages and advances more effective strategies for multilingual code LLM training.

翻译：编程语言的快速多样化发展为开发多语言代码大语言模型（LLM）带来了机遇与挑战。现有技术通常通过简单聚合多语言代码数据来训练代码LLM，但鲜有研究深入探索编程语言（PL）之间的深层关系，以及如何利用这些关系来优化代码LLM的训练与推理。在本工作中，我们研究了两个基本问题：1）编程语言之间存在哪些深层语言学关系？2）如何利用这些关系来改进多语言代码LLM？我们提出了一个基于嵌入的框架来揭示编程语言的潜在家族。我们的方法首先定义了编程语言的21个基本语言学特征（如变量定义、控制结构和方法声明），然后利用LLM生成跨多种语言的特征对齐代码样本。通过对来自19种语言的这些语义平行的代码片段进行嵌入，我们构建了一个相似性矩阵并进行层次聚类，以揭示内在的语言关系。我们的分析揭示了编程语言之间清晰的层次结构。密切相关的语言形成了明确的聚类（例如，C、C++、Java和Swift聚集在一起），而Go则表现为一个具有最高跨语言相似性的中心语言。基于所揭示的语言家族，我们提出了三种策略来增强多语言LLM训练：跨语言学相关语言的迁移学习、语言学邻近性引导的课程学习以及基于质心的中间代码翻译。在4个代码智能任务上的实验表明，我们的方法显著提升了多语言LLM的性能。这项工作为理解编程语言提供了一个普适的视角，并推动了多语言代码LLM训练的更有效策略。