Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks: carefully crafted malicious inputs intended to circumvent safety guardrails and elicit harmful responses. As such, we present AutoAdv, a novel framework that automates adversarial prompt generation to systematically evaluate and expose vulnerabilities in LLM safety mechanisms. Our approach leverages a parametric attacker LLM to produce semantically disguised malicious prompts through strategic rewriting techniques, specialized system prompts, and optimized hyperparameter configurations. The primary contribution of our work is a dynamic, multi-turn attack methodology that analyzes failed jailbreak attempts and iteratively generates refined follow-up prompts, leveraging techniques such as roleplaying, misdirection, and contextual manipulation. We quantitatively evaluate attack success rate (ASR) using the StrongREJECT (arXiv:2402.10260 [cs.CL]) framework across sequential interaction turns. Through extensive empirical evaluation of state-of-the-art models--including ChatGPT, Llama, and DeepSeek--we reveal significant vulnerabilities, with our automated attacks achieving jailbreak success rates of up to 86% for harmful content generation. Our findings reveal that current safety mechanisms remain susceptible to sophisticated multi-turn attacks, emphasizing the urgent need for more robust defense strategies.
翻译:大语言模型(LLMs)持续表现出对越狱攻击的脆弱性:攻击者通过精心构造的恶意输入,旨在绕过安全防护机制并诱导模型生成有害回复。为此,我们提出AutoAdv——一种自动化生成对抗性提示的新型框架,用于系统化评估和暴露LLM安全机制中的漏洞。该方法利用参数化的攻击者LLM,通过策略性重写技术、专用系统提示及优化的超参数配置,生成语义伪装的恶意提示。本工作的核心贡献在于提出一种动态多轮攻击方法:该方法分析失败的越狱尝试,并迭代生成经过优化的后续提示,其技术手段包括角色扮演、诱导转向及上下文操控等。我们采用StrongREJECT(arXiv:2402.10260 [cs.CL])框架,通过连续交互轮次对攻击成功率(ASR)进行量化评估。通过对包括ChatGPT、Llama和DeepSeek在内的前沿模型进行广泛实证评估,我们揭示了其显著的安全缺陷:在有害内容生成任务中,自动化攻击的越狱成功率最高可达86%。研究结果表明,当前安全机制仍难以抵御复杂的多轮攻击,这凸显了构建更鲁棒防御策略的紧迫性。