The rise of API-only access to state-of-the-art LLMs highlights the need for effective black-box jailbreak methods to identify model vulnerabilities in real-world settings. Without a principled objective for gradient-based optimization, most existing approaches rely on genetic algorithms, which are limited by their initialization and dependence on manually curated prompt pools. Furthermore, these methods require individual optimization for each prompt, failing to provide a comprehensive characterization of model vulnerabilities. To address this gap, we introduce VERA: Variational infErence fRamework for jAilbreaking. VERA casts black-box jailbreak prompting as a variational inference problem, training a small attacker LLM to approximate the target LLM's posterior over adversarial prompts. Once trained, the attacker can generate diverse, fluent jailbreak prompts for a target query without re-optimization. Experimental results show that VERA achieves strong performance across a range of target LLMs, highlighting the value of probabilistic inference for adversarial prompt generation.
翻译:仅通过API访问最先进的大型语言模型(LLM)的趋势凸显了在真实场景中需要有效的黑盒越狱方法以识别模型漏洞。由于缺乏基于梯度的优化的原则性目标,现有方法大多依赖遗传算法,这些方法受限于其初始化和对人工构建提示池的依赖。此外,这些方法需要对每个提示进行单独优化,无法全面刻画模型的脆弱性。为填补这一空白,我们提出了VERA:针对越狱的变分推断框架。VERA将黑盒越狱提示构建为一个变分推断问题,训练一个小型攻击者LLM来近似目标LLM在对抗性提示上的后验分布。一旦训练完成,攻击者无需重新优化即可为目标查询生成多样且流畅的越狱提示。实验结果表明,VERA在一系列目标LLM上均表现出色,突显了概率推断在对抗性提示生成中的价值。