Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.
翻译:大型语言模型(LLMs)在多样化任务中展现出卓越性能,但仍易受绕过安全机制的越狱攻击影响。本文提出RAID(拒绝感知集成解码)框架,该框架通过构建既能诱导受限内容又保持流畅性的对抗性后缀,系统性地探测这些弱点。RAID将离散词元松弛为连续嵌入向量,并通过联合目标函数进行优化,该目标函数(i)鼓励生成受限响应,(ii)引入拒绝感知正则化器以引导激活向量远离嵌入空间中的拒绝方向,(iii)应用连贯性约束项以保持语义合理性与非冗余性。优化完成后,通过批评器引导的解码过程将嵌入向量映射回词元,该过程平衡了嵌入亲和度与语言模型似然度。这种集成方法生成的后缀既能有效绕过防御机制,又具有自然的形式。在多个开源LLM上的实验表明,与近期白盒和黑盒基线方法相比,RAID能以更少的查询次数和更低的计算成本实现更高的攻击成功率。这些发现凸显了嵌入空间正则化对于理解和缓解LLM越狱脆弱性的重要性。