Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, such as ``Sure, here is...'', suffering from limited convergence, unnatural prompts, and high computational costs. We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings. This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods. The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41\% averaged across 18 LLMs, including 100\% on 11 models) while maintaining stealthiness and efficiency. Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack. The code will be publicly available.
翻译:大语言模型(LLMs)日益采用对齐技术以防止有害输出。尽管存在这些防护措施,攻击者仍可通过构建诱导LLMs生成有害内容的提示词来规避防护。现有方法通常以确切的肯定性响应(如“当然,这里是...”)为目标,存在收敛性有限、提示词不自然及计算成本高等问题。本文提出语义表征攻击,这是一种从根本上重新概念化针对对齐LLMs对抗目标的新范式。与针对精确文本模式不同,我们的方法利用包含具有等效有害含义的多样化响应的语义表征空间。这一创新解决了现有方法中攻击效能与提示词自然性之间固有的权衡问题。我们提出语义表征启发式搜索算法,通过在增量扩展过程中保持可解释性,高效生成语义连贯且简洁的对抗性提示词。我们为语义收敛建立了严格的理论保证,并证明该方法实现了前所未有的攻击成功率(在18个LLMs上平均达89.41%,其中11个模型达到100%),同时保持隐蔽性和高效性。全面的实验结果证实了我们的语义表征攻击方法的整体优越性。代码将公开提供。