AutoAdv：面向大语言模型多轮越狱的自动化对抗提示生成 (AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models)

from arxiv, We encountered issues with the paper being hosted under my personal account, so we republished it under a different account associated with a university email, which makes updates and management easier. As a result, this version is a duplicate of arXiv:2511.02376

Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks: carefully crafted malicious inputs intended to circumvent safety guardrails and elicit harmful responses. As such, we present AutoAdv, a novel framework that automates adversarial prompt generation to systematically evaluate and expose vulnerabilities in LLM safety mechanisms. Our approach leverages a parametric attacker LLM to produce semantically disguised malicious prompts through strategic rewriting techniques, specialized system prompts, and optimized hyperparameter configurations. The primary contribution of our work is a dynamic, multi-turn attack methodology that analyzes failed jailbreak attempts and iteratively generates refined follow-up prompts, leveraging techniques such as roleplaying, misdirection, and contextual manipulation. We quantitatively evaluate attack success rate (ASR) using the StrongREJECT (arXiv:2402.10260 [cs.CL]) framework across sequential interaction turns. Through extensive empirical evaluation of state-of-the-art models--including ChatGPT, Llama, and DeepSeek--we reveal significant vulnerabilities, with our automated attacks achieving jailbreak success rates of up to 86% for harmful content generation. Our findings reveal that current safety mechanisms remain susceptible to sophisticated multi-turn attacks, emphasizing the urgent need for more robust defense strategies.

翻译：大语言模型（LLMs）持续表现出对越狱攻击的脆弱性：攻击者通过精心构造的恶意输入，旨在绕过安全防护机制并诱导模型生成有害回复。为此，我们提出AutoAdv——一种自动化生成对抗性提示的新型框架，用于系统化评估和暴露LLM安全机制中的漏洞。该方法利用参数化的攻击者LLM，通过策略性重写技术、专用系统提示及优化的超参数配置，生成语义伪装的恶意提示。本工作的核心贡献在于提出一种动态多轮攻击方法：该方法分析失败的越狱尝试，并迭代生成经过优化的后续提示，其技术手段包括角色扮演、诱导转向及上下文操控等。我们采用StrongREJECT（arXiv:2402.10260 [cs.CL]）框架，通过连续交互轮次对攻击成功率（ASR）进行量化评估。通过对包括ChatGPT、Llama和DeepSeek在内的前沿模型进行广泛实证评估，我们揭示了其显著的安全缺陷：在有害内容生成任务中，自动化攻击的越狱成功率最高可达86%。研究结果表明，当前安全机制仍难以抵御复杂的多轮攻击，这凸显了构建更鲁棒防御策略的紧迫性。