As opposed to natural languages, source code understanding is influenced by grammatical relationships between tokens regardless of their identifier name. Graph representations of source code such as Abstract Syntax Tree (AST) can capture relationships between tokens that are not obvious from the source code. We propose a novel method, GN-Transformer to learn end-to-end on a fused sequence and graph modality we call Syntax-Code-Graph (SCG). GN-Transformer expands on Graph Networks (GN) framework using a self-attention mechanism. SCG is the result of the early fusion between a source code snippet and the AST representation. We perform experiments on the structure of SCG, an ablation study on the model design, and the hyper-parameters to conclude that the performance advantage is from the fused representation. The proposed methods achieve state-of-the-art performance in two code summarization datasets and across three automatic code summarization metrics (BLEU, METEOR, ROUGE-L). We further evaluate the human perceived quality of our model and previous work with an expert-user study. Our model outperforms the state-of-the-art in human perceived quality and accuracy.
翻译:与自然语言不同,源代码理解受符号之间语法关系的影响,而不论其标识名称为何。源代码(例如抽象语法树(AST))的图示可以捕捉源代码中并不明显的符号之间的关系。我们建议一种新颖的方法,即GN-Transer 来学习连接序列和图形模式上的端到端,我们称之为语法-Code-Graph(SCG)。GN-Transext利用自我注意机制在图形网络(GN)框架上扩展。SCG是源代码片和AST代表之间早期融合的结果。我们用专家用户研究的准确性进一步评估了我们模型和先前工作的人性质量和人性,并进行了专家用户质量和人性精度研究。我们提出的方法在两个代码总和数据集中达到了最新性性能,并跨越了三个自动代码组合指标(LEWEWU、METEOR、ROUGE-L)。我们进一步评估了我们模型和人类先前工作的质量,我们用专家用户研究的准确性模型和人性质量来超越了模型。