提高按等级语言模式进行的法规完成培训与测试之间数据不一致性的力度 (Improving the Robustness to Data Inconsistency between Training and Testing for Code Completion by Hierarchical Language Model)

In the field of software engineering, applying language models to the token sequence of source code is the state-of-art approach to build a code recommendation system. The syntax tree of source code has hierarchical structures. Ignoring the characteristics of tree structures decreases the model performance. Current LSTM model handles sequential data. The performance of LSTM model will decrease sharply if the noise unseen data is distributed everywhere in the test suite. As code has free naming conventions, it is common for a model trained on one project to encounter many unknown words on another project. If we set many unseen words as UNK just like the solution in natural language processing, the number of UNK will be much greater than the sum of the most frequently appeared words. In an extreme case, just predicting UNK at everywhere may achieve very high prediction accuracy. Thus, such solution cannot reflect the true performance of a model when encountering noise unseen data. In this paper, we only mark a small number of rare words as UNK and show the prediction performance of models under in-project and cross-project evaluation. We propose a novel Hierarchical Language Model (HLM) to improve the robustness of LSTM model to gain the capacity about dealing with the inconsistency of data distribution between training and testing. The newly proposed HLM takes the hierarchical structure of code tree into consideration to predict code. HLM uses BiLSTM to generate embedding for sub-trees according to hierarchies and collects the embedding of sub-trees in context to predict next code. The experiments on inner-project and cross-project data sets indicate that the newly proposed Hierarchical Language Model (HLM) performs better than the state-of-art LSTM model in dealing with the data inconsistency between training and testing and achieves averagely 11.2\% improvement in prediction accuracy.

翻译：在软件工程领域,将语言模型应用到源代码的象征序列中,是用来构建代码建议系统的最先进的方法。源代码的语法树有等级结构。无视树结构的特性会降低模型性能。当前的 LSTM 模型会处理连续数据。如果在测试套件中无声无声数据分布在任何地方,LSTM 模型的性能将急剧下降。由于代码可以自由命名,在一个项目上培训的模型通常会遇到另一个项目上的许多未知字。如果我们在自然语言处理中将许多看不见的词设置为UNK,那么源代码的语法将大大高于最经常出现的词的总和。在一个极端的情况下,仅仅预测树结构中的UNKTM 模型(HLM) 的性能将会大大降低,这样,在测试一个模型时,当遇到未知未知数据时,LS 的性能将无法反映一个模型的真实性。在项目和跨项目评价中,我们只标有少量的单词,并显示模型的预测性能。我们建议一个新的HRIS 的高级语言模型(HLM) 的下级语言内部语言模型(HLM) 运行的性语言中,在测试中将改进数据运行的稳定性数据流数据测试中进行不稳性数据测试,从而改进到新的数据流数据流数据流数据流的性能测试到新数据流的高级数据流的性能测试中,使LS级数据流数据流的性能测试到新的数据流数据流的性能测试到新的数据流数据流流流流。

相关内容

MoDELS

关注 30

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

47+阅读 · 2022年10月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

161+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

52+阅读 · 2020年1月30日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

35+阅读 · 2020年1月23日