Large language models (LLMs) are increasingly prevalent in security research. Their unique characteristics, however, introduce challenges that undermine established paradigms of reproducibility, rigor, and evaluation. Prior work has identified common pitfalls in traditional machine learning research, but these studies predate the advent of LLMs. In this paper, we identify \emph{nine} common pitfalls that have become (more) relevant with the emergence of LLMs and that can compromise the validity of research involving them. These pitfalls span the entire computation process, from data collection, pre-training, and fine-tuning to prompting and evaluation. We assess the prevalence of these pitfalls across all 72 peer-reviewed papers published at leading Security and Software Engineering venues between 2023 and 2024. We find that every paper contains at least one pitfall, and each pitfall appears in multiple papers. Yet only 15.7\% of the present pitfalls were explicitly discussed, suggesting that the majority remain unrecognized. To understand their practical impact, we conduct four empirical case studies showing how individual pitfalls can mislead evaluation, inflate performance, or impair reproducibility. Based on our findings, we offer actionable guidelines to support the community in future work.
翻译:大型语言模型(LLMs)在安全研究中的应用日益广泛。然而,其独特特性带来了挑战,可能削弱研究在可复现性、严谨性和评估方面已建立的范式。先前研究已识别出传统机器学习研究中的常见陷阱,但这些研究早于LLMs的出现。本文识别了随着LLMs兴起而变得(更为)相关、且可能损害相关研究有效性的九种常见陷阱。这些陷阱贯穿整个计算流程,涵盖数据收集、预训练、微调、提示工程及评估等环节。我们评估了2023年至2024年间在顶级安全与软件工程会议上发表的72篇同行评审论文中这些陷阱的普遍性。研究发现:每篇论文至少存在一个陷阱,且每个陷阱均出现在多篇论文中。然而,仅有15.7%的现存陷阱被明确讨论,表明多数陷阱尚未被识别。为理解其实际影响,我们开展了四项实证案例研究,展示单个陷阱如何误导评估、夸大性能或损害可复现性。基于研究结果,我们提出了可操作的指导原则,以支持学界未来的研究工作。