When used to process source code, graph neural networks have been shown to produce impressive results for a wide range of software engineering tasks. Existing techniques, however, still have two issues: (1) long-term dependency and (2) different code components are treated as equals when they should not be. To address these issues, we propose a method for representing code as a hierarchy (Code Hierarchy), in which different code components are represented separately at various levels of granularity. Then, to process each level of representation, we design a novel network architecture, HIRGAST, which combines the strengths of Heterogeneous Graph Transformer Networks and Tree-based Convolutional Neural Networks to learn Abstract Syntax Trees enriched with code dependency information. We also propose a novel pretraining objective called Missing Subtree Prediction to complement our Code Hierarchy. The evaluation results show that our method significantly outperforms other baselines in three downstream tasks: any-code completion, code classification, and code clone detection.
翻译:用于处理源代码时,图表神经网络显示,它为一系列广泛的软件工程任务产生了令人印象深刻的结果。但现有技术仍然有两个问题:(1)长期依赖性和(2)不同的代码组成部分在不应处理时被视为平等。为了解决这些问题,我们提议了一种方法,将代码作为等级(Code 等级)来代表(Code searararchy),其中不同的代码组成部分在不同层次的颗粒中分别代表。然后,为了处理每一层次的代表性,我们设计了一个新的网络结构,即HIRGAST,它将异质图形变异器网络和基于树木的神经网络的优势结合起来,学习用代码依赖性信息丰富起来的抽象语系树。我们还提议了一个叫“失踪子树预测”的新的培训前目标,以补充我们的代码的等级。评价结果表明,我们的方法大大优于下游三个任务中的其他基线:任何代码完成、代码分类和代码克隆探测。