By integrating language understanding with perceptual modalities such as images, multimodal large language models (MLLMs) constitute a critical substrate for modern AI systems, particularly intelligent agents operating in open and interactive environments. However, their increasing accessibility also raises heightened risks of misuse, such as generating harmful or unsafe content. To mitigate these risks, alignment techniques are commonly applied to align model behavior with human values. Despite these efforts, recent studies have shown that jailbreak attacks can circumvent alignment and elicit unsafe outputs. Currently, most existing jailbreak methods are tailored for open-source models and exhibit limited effectiveness against commercial MLLM-integrated systems, which often employ additional filters. These filters can detect and prevent malicious input and output content, significantly reducing jailbreak threats. In this paper, we reveal that the success of these safety filters heavily relies on a critical assumption that malicious content must be explicitly visible in either the input or the output. This assumption, while often valid for traditional LLM-integrated systems, breaks down in MLLM-integrated systems, where attackers can leverage multiple modalities to conceal adversarial intent, leading to a false sense of security in existing MLLM-integrated systems. To challenge this assumption, we propose Odysseus, a novel jailbreak paradigm that introduces dual steganography to covertly embed malicious queries and responses into benign-looking images. Extensive experiments on benchmark datasets demonstrate that our Odysseus successfully jailbreaks several pioneering and realistic MLLM-integrated systems, achieving up to 99% attack success rate. It exposes a fundamental blind spot in existing defenses, and calls for rethinking cross-modal security in MLLM-integrated systems.
翻译:通过将语言理解与图像等感知模态相结合,多模态大语言模型构成了现代人工智能系统的关键基础,尤其是在开放交互环境中运行的智能体。然而,其日益增长的可访问性也带来了更高的滥用风险,例如生成有害或不安全内容。为缓解这些风险,通常采用对齐技术使模型行为与人类价值观保持一致。尽管做出了这些努力,近期研究表明,越狱攻击仍可绕过对齐机制并诱导出不安全输出。目前,大多数现有越狱方法专为开源模型设计,对采用额外过滤机制的商用多模态大语言模型集成系统效果有限。这些过滤器能够检测并阻止恶意输入和输出内容,显著降低了越狱威胁。本文揭示,这些安全过滤器的有效性严重依赖于一个关键假设:恶意内容必须在输入或输出中明确可见。该假设在传统大语言模型集成系统中通常成立,但在多模态大语言模型集成系统中却存在漏洞——攻击者可利用多模态特性隐藏对抗意图,导致现有系统产生虚假的安全感。为挑战这一假设,我们提出奥德修斯,一种创新的越狱范式,通过双重隐写术将恶意查询与响应隐蔽嵌入看似良性的图像中。在基准数据集上的大量实验表明,奥德修斯成功破解了多个前沿且现实的多模态大语言模型集成系统,攻击成功率高达99%。这暴露出现有防御机制的根本性盲区,并呼吁重新审视多模态大语言模型集成系统中的跨模态安全问题。