Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often produce hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial attacks for hallucination elicitation in LLMs, but it often produces unrealistic prompts, either by inserting gibberish tokens or by altering the original meaning. As a result, these approaches offer limited insight into how hallucinations may occur in practice. While adversarial attacks in computer vision often involve realistic modifications to input images, the problem of finding realistic adversarial prompts for eliciting LLM hallucinations has remained largely underexplored. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA) to elicit hallucinations via realistic modifications to the prompt that preserve its meaning while maintaining semantic coherence. Our contributions are threefold: (i) we formulate finding realistic attacks for hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (ii) we introduce a constraint-preserving zeroth-order method to effectively search for adversarial yet feasible prompts; and (iii) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no semantic equivalence or semantic coherence errors compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations. Code is available at https://github.com/Buyun-Liang/SECA.
翻译:大语言模型(LLMs)正日益部署于高风险领域。然而,最先进的大语言模型常产生幻觉,对其可靠性提出了严重关切。先前的研究探索了通过对抗性攻击引发大语言模型幻觉的方法,但这些方法通常通过插入无意义标记或改变原意来生成不切实际的提示,因此对实践中幻觉如何发生的洞察有限。尽管计算机视觉中的对抗性攻击常涉及对输入图像进行逼真的修改,但寻找用于引发大语言模型幻觉的逼真对抗性提示这一问题在很大程度上仍未得到充分探索。为填补这一空白,我们提出了语义等价且连贯的攻击方法(SECA),通过对提示进行保持原意且语义连贯的逼真修改来引发幻觉。我们的贡献有三方面:(i)我们将寻找用于引发幻觉的逼真攻击形式化为在语义等价和连贯性约束下对输入提示空间的约束优化问题;(ii)我们引入了一种保持约束的零阶方法,以有效搜索对抗性且可行的提示;(iii)通过在开放式多项选择题回答任务上的实验,我们证明与现有方法相比,SECA在几乎不产生语义等价或语义连贯性错误的情况下,实现了更高的攻击成功率。SECA突显了开源和商业梯度不可访问的大语言模型对逼真且合理的提示变动的敏感性。代码可在 https://github.com/Buyun-Liang/SECA 获取。