Identifying vulnerable code is a precautionary measure to counter software security breaches. Tedious expert effort has been spent to build static analyzers, yet insecure patterns are barely fully enumerated. This work explores a deep learning approach to automatically learn the insecure patterns from code corpora. Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program, in order to improve prediction performance. Compared with a generic GNN, our enhancements include a synthesis of multiple representations learned from the several parsed graphs of a program, and a new training loss metric that leverages the fine granularity of labeling. Our model outperforms multiple text, image and graph-based approaches, across two real-world datasets.
翻译:识别脆弱代码是防止软件安全违约的预防措施。 高难度的专家努力已经用于建立静态分析器, 但安全模式却很少被充分列举。 这项工作探索了一种从代码公司自动学习不安全模式的深层次学习方法。 由于代码自然地接受有分解的图形结构,我们开发了一个新型的图形神经网络(GNN)来利用一个程序的语义背景和结构规律性,以便改善预测性能。 与通用的 GNN相比, 我们的改进包括综合从一个程序的若干解析图解中学习的多个代表,以及利用标签精细颗粒的新培训损失指标。 我们的模型超越了两个真实世界数据集的多文本、图像和图表法系方法。