Code representation learning, which aims to encode the semantics of source code into distributed vectors, plays an important role in recent deep-learning-based models for code intelligence. Recently, many pre-trained language models for source code (e.g., CuBERT and CodeBERT) have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code search, code clone detection, and program translation. Current approaches typically consider the source code as a plain sequence of tokens, or inject the structure information (e.g., AST and data-flow) into the sequential model pre-training. To further explore the properties of programming languages, this paper proposes SynCoBERT, a syntax-guided multi-modal contrastive pre-training approach for better code representations. Specially, we design two novel pre-training objectives originating from the symbolic and syntactic properties of source code, i.e., Identifier Prediction (IP) and AST Edge Prediction (TEP), which are designed to predict identifiers, and edges between two nodes of AST, respectively. Meanwhile, to exploit the complementary information in semantically equivalent modalities (i.e., code, comment, AST) of the code, we propose a multi-modal contrastive learning strategy to maximize the mutual information among different modalities. Extensive experiments on four downstream tasks related to code intelligence show that SynCoBERT advances the state-of-the-art with the same pre-training corpus and model size.
翻译:代码代表学习旨在将源代码的语义解译为分布式矢量,在近期基于深层学习的代码智能模型中发挥重要作用。最近,许多源代码(如CuBERT和codBERT)的预培训语言模型(如CuBERT和CodBERT)被提议为代码背景的模型,并用作代码搜索、代码克隆检测和程序翻译等下游代码智能任务的基础。当前方法通常将源代码视为标识的简单序列,或将结构信息(如AST和数据流)输入顺序模式培训前的模型。为了进一步探索编程语言的特性,本文件提议采用SynCoBERT, 一种合成制导的多模式培训前方法,用于更好的代码表达。特别是,我们设计了两个新的培训前培训目标,即源代码的符号和合成特性,即标识模型预测(IP)和AST-Syrige Surningion (TEP),设计用于预测标识标识,以及AST-Cloudio-deal 的两种节点之间的节点,分别用来显示AST-stal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-de,同时,我们分别利用了A-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-ex-ex-exal-exal-ex-ex-ex-exal-ex-exal-exal-exal-ex-exal-exal-ex-ex-ex-exal-exal-ex-ex-exal-deal-deal-deal-deal-deal-exal-deal-deal-deal-deal-deal-ex-deal-deal-ex-ex-exal-deal-al-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-