Effective code retrieval is indispensable and it has become an important paradigm to search code in hybrid mode using both natural language and code snippets. Nevertheless, it remains unclear whether existing approaches can effectively leverage such hybrid queries, particularly in cross-language contexts. We conduct a comprehensive empirical study of representative code models and reveal three challenges: (1) insufficient semantic understanding; (2) inefficient fusion in hybrid code retrieval; and (3) weak generalization in cross-language scenarios. To address these challenges, we propose UniCoR, a novel self-supervised framework that learns Unified Code Representations framework designed to learn unified and robust code representations. Firstly, we design a multi-perspective supervised contrastive learning module to enhance semantic understanding and modality fusion. It aligns representations from multiple perspectives, including code-to-code, natural language-to-code, and natural language-to-natural language, enforcing the model to capture a semantic essence among modalities. Secondly, we introduce a representation distribution consistency learning module to improve cross-language generalization, which explicitly aligns the feature distributions of different programming languages, enabling language-agnostic representation learning. Extensive experiments on both empirical benchmark and large-scale benchmark show that UniCoR outperforms all baseline models, achieving an average improvement of 8.64% in MRR and 11.54% in MAP over the best-performing baseline. Furthermore, UniCoR exhibits stability in hybrid code retrieval and generalization capability in cross-language scenarios.
翻译:有效的代码检索不可或缺,且已成为利用自然语言与代码片段进行混合模式搜索代码的重要范式。然而,现有方法能否有效利用此类混合查询,尤其是在跨语言情境下,仍不明确。我们对代表性代码模型进行了全面的实证研究,揭示了三大挑战:(1)语义理解不足;(2)混合代码检索中的融合效率低下;(3)跨语言场景下的泛化能力弱。为应对这些挑战,我们提出了UniCoR,一种新颖的自监督框架,旨在学习统一且鲁棒的代码表示。首先,我们设计了一个多视角监督对比学习模块,以增强语义理解和模态融合。该模块从多个视角(包括代码到代码、自然语言到代码以及自然语言到自然语言)对齐表示,迫使模型捕捉模态间的语义本质。其次,我们引入了表示分布一致性学习模块以提升跨语言泛化能力,该模块显式地对齐不同编程语言的特征分布,从而实现与语言无关的表示学习。在实证基准和大规模基准上的广泛实验表明,UniCoR在所有基线模型中表现最优,在MRR和MAP指标上分别平均提升了8.64%和11.54%。此外,UniCoR在混合代码检索中展现出稳定性,并在跨语言场景中具备良好的泛化能力。