Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.
翻译:尽管多模态扩散模型(如文本到图像模型)取得了显著进展并展现出广泛适用性,但其对对抗性输入的敏感性仍未得到充分探究。与预期相反,我们的研究发现现有扩散模型中文本与图像模态之间的对齐并不充分。这种错位带来了显著风险,尤其是在生成不当或不适宜工作场合(NSFW)内容时。为此,我们提出了一种名为提示受限多模态攻击(PReMA)的新型攻击方法,通过修改输入图像并结合任意指定提示词来操纵生成内容,而无需更改提示词本身。PReMA是首个仅通过创建对抗性图像来操纵模型输出的攻击方法,区别于以往主要生成对抗性提示词以产生NSFW内容的方法。因此,PReMA对多模态扩散模型的完整性构成了新型威胁,特别是在使用固定提示词的图像编辑应用中。通过对多种模型在图像修复和风格迁移任务上的综合评估,证实了PReMA的强大有效性。