Vision-Language Pre-training (VLP) models have exhibited unprecedented capability in many applications by taking full advantage of the multimodal alignment. However, previous studies have shown they are vulnerable to maliciously crafted adversarial samples. Despite recent success, these methods are generally instance-specific and require generating perturbations for each input sample. In this paper, we reveal that VLP models are also vulnerable to the instance-agnostic universal adversarial perturbation (UAP). Specifically, we design a novel Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC) to achieve the attack. In light that the pivotal multimodal alignment is achieved through the advanced contrastive learning technique, we devise to turn this powerful weapon against themselves, i.e., employ a malicious version of contrastive learning to train the C-PGC based on our carefully crafted positive and negative image-text pairs for essentially destroying the alignment relationship learned by VLP models. Besides, C-PGC fully utilizes the characteristics of Vision-and-Language (V+L) scenarios by incorporating both unimodal and cross-modal information as effective guidance. Extensive experiments show that C-PGC successfully forces adversarial samples to move away from their original area in the VLP model's feature space, thus essentially enhancing attacks across various victim models and V+L tasks. The GitHub repository is available at https://github.com/ffhibnese/CPGC_VLP_Universal_Attacks.
翻译:视觉-语言预训练模型通过充分利用多模态对齐能力,已在众多应用中展现出前所未有的性能。然而,先前研究表明此类模型易受恶意构造的对抗样本攻击。尽管现有方法取得了一定成功,但这些方法通常具有实例特异性,需为每个输入样本单独生成扰动。本文揭示了VLP模型同样易受实例无关的通用对抗扰动攻击。具体而言,我们设计了一种新颖的基于跨模态条件的对比训练扰动生成器,以实现此类攻击。鉴于关键的多模态对齐是通过先进的对比学习技术实现的,我们提出将这一强大工具转化为攻击武器——即采用恶意版本的对比学习来训练C-PGC,该方法基于我们精心构建的正负图像-文本对,从根本上破坏VLP模型习得的对齐关系。此外,C-PGC通过融合单模态与跨模态信息作为有效指导,充分利用了视觉-语言场景的特性。大量实验表明,C-PGC成功迫使对抗样本在VLP模型特征空间中偏离原始区域,从而显著增强了对不同受害模型及V+L任务的攻击效果。项目代码已发布于https://github.com/ffhibnese/CPGC_VLP_Universal_Attacks。