Large language models (LLMs) are increasingly considered for use in high-impact workflows, including academic peer review. However, LLMs are vulnerable to document-level hidden prompt injection attacks. In this work, we construct a dataset of approximately 500 real academic papers accepted to ICML and evaluate the effect of embedding hidden adversarial prompts within these documents. Each paper is injected with semantically equivalent instructions in four different languages and reviewed using an LLM. We find that prompt injection induces substantial changes in review scores and accept/reject decisions for English, Japanese, and Chinese injections, while Arabic injections produce little to no effect. These results highlight the susceptibility of LLM-based reviewing systems to document-level prompt injection and reveal notable differences in vulnerability across languages.
翻译:大语言模型(LLMs)正越来越多地被考虑应用于高影响力工作流,包括学术同行评审。然而,LLMs容易受到文档级隐藏提示注入攻击。在本工作中,我们构建了一个包含约500篇被ICML接收的真实学术论文的数据集,并评估了在这些文档中嵌入隐藏对抗性提示的效果。每篇论文均被注入了四种不同语言但语义相同的指令,并使用LLM进行评审。我们发现,对于英语、日语和中文的注入,提示注入会导致评审分数和接收/拒绝决定发生显著变化,而阿拉伯语注入则几乎不产生任何影响。这些结果凸显了基于LLM的评审系统对文档级提示注入的易感性,并揭示了不同语言间脆弱性的显著差异。