RAID：针对大型语言模型越狱的拒绝感知集成解码框架 (RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs)

Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.

翻译：大型语言模型（LLMs）在多样化任务中展现出卓越性能，但仍易受绕过安全机制的越狱攻击影响。本文提出RAID（拒绝感知集成解码）框架，该框架通过构建既能诱导受限内容又保持流畅性的对抗性后缀，系统性地探测这些弱点。RAID将离散词元松弛为连续嵌入向量，并通过联合目标函数进行优化，该目标函数（i）鼓励生成受限响应，（ii）引入拒绝感知正则化器以引导激活向量远离嵌入空间中的拒绝方向，（iii）应用连贯性约束项以保持语义合理性与非冗余性。优化完成后，通过批评器引导的解码过程将嵌入向量映射回词元，该过程平衡了嵌入亲和度与语言模型似然度。这种集成方法生成的后缀既能有效绕过防御机制，又具有自然的形式。在多个开源LLM上的实验表明，与近期白盒和黑盒基线方法相比，RAID能以更少的查询次数和更低的计算成本实现更高的攻击成功率。这些发现凸显了嵌入空间正则化对于理解和缓解LLM越狱脆弱性的重要性。

相关内容

RAID

关注 0

独立硬盘冗余阵列（ RAID, Redundant Array of Independent Disks），旧称 廉价磁盘冗余阵列（ Redundant Array of Inexpensive Disks），简称 硬盘阵列。其基本思想就是把多个相对便宜的硬盘组合起来，成为一个硬盘阵列组，使性能达到甚至超过一个价格昂贵、容量巨大的硬盘。

【AAAI2026】Align3GR：面向 LLM 生成式推荐的统一多层次对齐方法

专知会员服务

13+阅读 · 11月17日

【CIKM2023】GiGaMAE: 通过协同潜在空间重建的可泛化图掩码自编码器

专知会员服务

23+阅读 · 2023年8月22日

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

专知会员服务

13+阅读 · 2020年4月9日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日