RoguePrompt：基于双层加密自重构以规避LLM内容审核的攻击方法 (RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation)

Content moderation pipelines for modern large language models combine static filters, dedicated moderation services, and alignment tuned base models, yet real world deployments still exhibit dangerous failure modes. This paper presents RoguePrompt, an automated jailbreak attack that converts a disallowed user query into a self reconstructing prompt which passes provider moderation while preserving the original harmful intent. RoguePrompt partitions the instruction across two lexical streams, applies nested classical ciphers, and wraps the result in natural language directives that cause the target model to decode and execute the hidden payload. Our attack assumes only black box access to the model and to the associated moderation endpoint. We instantiate RoguePrompt against GPT 4o and evaluate it on 2 448 prompts that a production moderation system previously marked as strongly rejected. Under an evaluation protocol that separates three security relevant outcomes bypass, reconstruction, and execution the attack attains 84.7 percent bypass, 80.2 percent reconstruction, and 71.5 percent full execution, substantially outperforming five automated jailbreak baselines. We further analyze the behavior of several automated and human aligned evaluators and show that dual layer lexical transformations remain effective even when detectors rely on semantic similarity or learned safety rubrics. Our results highlight systematic blind spots in current moderation practice and suggest that robust deployment will require joint reasoning about user intent, decoding workflows, and model side computation rather than surface level toxicity alone.

翻译：现代大型语言模型的内容审核流程通常结合静态过滤器、专用审核服务及经过对齐调优的基础模型，然而实际部署中仍存在危险的安全失效模式。本文提出RoguePrompt——一种自动化越狱攻击方法，该方法将被禁止的用户查询转换为可自重构的提示，使其在通过服务商审核的同时保留原始恶意意图。RoguePrompt将指令分割至两个词汇流，应用嵌套经典密码算法，并将结果包裹于自然语言指令中，诱使目标模型解码并执行隐藏载荷。本攻击仅需对模型及相关审核端点的黑盒访问权限。我们在GPT-4o上实例化RoguePrompt，并在2,448条曾被生产级审核系统标记为强烈拒绝的提示上进行评估。采用区分三种安全相关结果（绕过、重构、执行）的评估协议，该攻击实现84.7%的绕过率、80.2%的重构率及71.5%的完整执行率，显著优于五种自动化越狱基线方法。我们进一步分析了多种自动化及人工对齐评估器的行为，证明即使检测器依赖语义相似性或习得的安全准则，双层词汇变换仍保持有效性。研究结果揭示了当前审核实践中的系统性盲区，表明实现稳健部署需联合推理用户意图、解码流程及模型端计算，而非仅关注表层毒性。