Human developers can produce code with cybersecurity bugs. Can emerging 'smart' code completion tools help repair those bugs? In this work, we examine the use of large language models (LLMs) for code (such as OpenAI's Codex and AI21's Jurassic J-1) for zero-shot vulnerability repair. We investigate challenges in the design of prompts that coax LLMs into generating repaired versions of insecure code. This is difficult due to the numerous ways to phrase key information - both semantically and syntactically - with natural languages. We perform a large scale study of five commercially available, black-box, "off-the-shelf" LLMs, as well as an open-source model and our own locally-trained model, on a mix of synthetic, hand-crafted, and real-world security bug scenarios. Our experiments demonstrate that while the approach has promise (the LLMs could collectively repair 100% of our synthetically generated and hand-crafted scenarios), a qualitative evaluation of the model's performance over a corpus of historical real-world examples highlights challenges in generating functionally correct code.
翻译:人类开发者可以生成网络安全错误的代码。 出现“ 智能” 代码完成工具能够帮助修复这些错误吗? 在这项工作中, 我们检查了大型语言模型( LLMs) 用于代码( 如 OpenAI's Codex 和 AI21's Jurassic J-1 ) 的零发弱点修复。 我们调查了在设计快速信号时遇到的难题, 使 coax LMs 生成了修复的不安全代码版本。 这很难, 原因是用自然语言用多种方式用语言来表述关键信息 — 包括语义和方言语。 我们进行了大规模研究, 研究了五种商业可用、 黑盒、 “ 现成” LLMs 、 开源模型和我们本地培训的模型, 混合了合成、 手工和现实世界安全错误的假想。 我们的实验表明, 虽然这一方法很有希望( LLMs可以集体修复我们合成和手造假想的假想的100% ), 但对于模型在一系列历史现实范例中的表现进行了定性评估, 强调了生成正确代码的挑战。