Recent years have seen the successful application of large pre-trained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pre-training tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existing pre-training tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstream tasks. Experimental results demonstrate that SPT-Code achieves state-of-the-art performance on five code-related downstream tasks after fine-tuning.
翻译:近些年来,成功地应用了大型的预培训模式来进行代号学习,从而大大改进了许多与代号相关的下游任务。但是,在应用这些模式执行SE任务方面存在着一些问题。首先,大多数预培训模式只侧重于培训前的变换器编码器的编码器。但是,对于使用编码器解码器结构模型处理的生成任务,没有理由在培训前不使用解码器。第二,许多现有的预培训模式,包括诸如T5学习等最先进的模式,只是重新利用为自然语言设计的培训前任务。此外,为学习最终为代码化等与代码有关的任务所需的源代码的自然语言描述,现有的培训前任务需要由源代码和相关自然语言描述组成的双语内容,这严重限制了培训前的数据量。为此,我们建议采用一种先入为主的双轨预培训前的源代码模式,即为源代码代码的排序前的排序,为了在排序前和后的三个相关任务中进行精密化,我们最终学习了与排序相关的规则相关任务。