评估大型语言模型的软件安全理解能力 (Assessing the Software Security Comprehension of Large Language Models)

Large language models (LLMs) are increasingly used in software development, but their level of software security expertise remains unclear. This work systematically evaluates the security comprehension of five leading LLMs: GPT-4o-Mini, GPT-5-Mini, Gemini-2.5-Flash, Llama-3.1, and Qwen-2.5, using Blooms Taxonomy as a framework. We assess six cognitive dimensions: remembering, understanding, applying, analyzing, evaluating, and creating. Our methodology integrates diverse datasets, including curated multiple-choice questions, vulnerable code snippets (SALLM), course assessments from an Introduction to Software Security course, real-world case studies (XBOW), and project-based creation tasks from a Secure Software Engineering course. Results show that while LLMs perform well on lower-level cognitive tasks such as recalling facts and identifying known vulnerabilities, their performance degrades significantly on higher-order tasks that require reasoning, architectural evaluation, and secure system creation. Beyond reporting aggregate accuracy, we introduce a software security knowledge boundary that identifies the highest cognitive level at which a model consistently maintains reliable performance. In addition, we identify 51 recurring misconception patterns exhibited by LLMs across Blooms levels.

翻译：大型语言模型（LLMs）在软件开发中的应用日益广泛，但其软件安全专业知识的水平仍不明确。本研究以布鲁姆分类法为框架，系统性地评估了五种主流LLM的安全理解能力：GPT-4o-Mini、GPT-5-Mini、Gemini-2.5-Flash、Llama-3.1和Qwen-2.5。我们评估了六个认知维度：记忆、理解、应用、分析、评估与创造。我们的方法整合了多种数据集，包括精心设计的多选题、易受攻击的代码片段（SALLM）、软件安全导论课程的课程评估、真实世界案例研究（XBOW），以及来自安全软件工程课程的基于项目的创造任务。结果表明，虽然LLMs在回忆事实和识别已知漏洞等较低层级的认知任务上表现良好，但在需要推理、架构评估和安全系统创建等高阶任务上，其性能显著下降。除了报告总体准确率之外，我们引入了软件安全知识边界的概念，用以识别模型能够持续保持可靠性能的最高认知层级。此外，我们识别出LLMs在布鲁姆分类法各层级中表现出的51种反复出现的误解模式。