Large language models (LLMs) are increasingly used in software development, but their level of software security expertise remains unclear. This work systematically evaluates the security comprehension of five leading LLMs: GPT-4o-Mini, GPT-5-Mini, Gemini-2.5-Flash, Llama-3.1, and Qwen-2.5, using Blooms Taxonomy as a framework. We assess six cognitive dimensions: remembering, understanding, applying, analyzing, evaluating, and creating. Our methodology integrates diverse datasets, including curated multiple-choice questions, vulnerable code snippets (SALLM), course assessments from an Introduction to Software Security course, real-world case studies (XBOW), and project-based creation tasks from a Secure Software Engineering course. Results show that while LLMs perform well on lower-level cognitive tasks such as recalling facts and identifying known vulnerabilities, their performance degrades significantly on higher-order tasks that require reasoning, architectural evaluation, and secure system creation. Beyond reporting aggregate accuracy, we introduce a software security knowledge boundary that identifies the highest cognitive level at which a model consistently maintains reliable performance. In addition, we identify 51 recurring misconception patterns exhibited by LLMs across Blooms levels.
翻译:大型语言模型(LLMs)在软件开发中的应用日益广泛,但其软件安全专业知识的水平仍不明确。本研究以布鲁姆分类法为框架,系统性地评估了五种主流LLM的安全理解能力:GPT-4o-Mini、GPT-5-Mini、Gemini-2.5-Flash、Llama-3.1和Qwen-2.5。我们评估了六个认知维度:记忆、理解、应用、分析、评估与创造。我们的方法整合了多种数据集,包括精心设计的多选题、易受攻击的代码片段(SALLM)、软件安全导论课程的课程评估、真实世界案例研究(XBOW),以及来自安全软件工程课程的基于项目的创造任务。结果表明,虽然LLMs在回忆事实和识别已知漏洞等较低层级的认知任务上表现良好,但在需要推理、架构评估和安全系统创建等高阶任务上,其性能显著下降。除了报告总体准确率之外,我们引入了软件安全知识边界的概念,用以识别模型能够持续保持可靠性能的最高认知层级。此外,我们识别出LLMs在布鲁姆分类法各层级中表现出的51种反复出现的误解模式。