Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.
翻译:大型语言模型(LLMs)仍易受越狱攻击的影响,此类攻击试图诱导模型生成有害回复。这些攻击的不断演变和多样性为防御系统带来诸多挑战,包括:(1)无需昂贵重训练即可适应新兴攻击策略;(2)平衡安全性与实用性之间的权衡。为应对这些挑战,我们提出检索增强防御(RAD),一种新颖的越狱检测框架,该框架将已知攻击示例数据库整合至检索增强生成中,用于推断攻击系统时潜在的恶意用户查询及越狱策略。RAD支持对新发现越狱策略进行免训练更新,并提供平衡安全性与实用性的机制。在StrongREJECT数据集上的实验表明,RAD显著降低了PAP、PAIR等强越狱攻击的有效性,同时对良性查询保持较低的拒绝率。我们提出一种新颖的评估方案,并证明RAD能以可控方式在一系列操作点上实现稳健的安全-效用权衡。