The advent of multimodal deep learning models, such as CLIP, has unlocked new frontiers in a wide range of applications, from image-text understanding to classification tasks. However, these models are not safe for adversarial attacks, particularly backdoor attacks, which can subtly manipulate model behavior. Moreover, existing defense methods typically involve training from scratch or fine-tuning using a large dataset without pinpointing the specific labels that are affected. In this study, we introduce an innovative strategy to enhance the robustness of multimodal contrastive learning models against such attacks. In particular, given a poisoned CLIP model, our approach can identify the backdoor trigger and pinpoint the victim samples and labels in an efficient manner. To that end, an image segmentation ``oracle'' is introduced as the supervisor for the output of the poisoned CLIP. We develop two algorithms to rectify the poisoned model: (1) differentiating between CLIP and Oracle's knowledge to identify potential triggers; (2) pinpointing affected labels and victim samples, and curating a compact fine-tuning dataset. With this knowledge, we are allowed to rectify the poisoned CLIP model to negate backdoor effects. Extensive experiments on visual recognition benchmarks demonstrate our strategy is effective in CLIP-based backdoor defense.
翻译:多模态深度学习模型(如CLIP)的出现,为从图像-文本理解到分类任务等广泛应用领域开辟了新前沿。然而,这些模型在面对对抗性攻击时并不安全,尤其是后门攻击,其能够微妙地操控模型行为。此外,现有防御方法通常需要从头开始训练或使用大规模数据集进行微调,而无法精确定位受影响的特定标签。在本研究中,我们提出了一种创新策略,以增强多模态对比学习模型对此类攻击的鲁棒性。具体而言,给定一个被投毒的CLIP模型,我们的方法能够高效地识别后门触发器,并精确定位受害样本和标签。为此,我们引入了一个图像分割“预言机”作为被投毒CLIP输出的监督器。我们开发了两种算法来修正被投毒模型:(1)通过区分CLIP与预言机的知识来识别潜在触发器;(2)精确定位受影响的标签和受害样本,并构建一个紧凑的微调数据集。基于这些信息,我们能够修正被投毒的CLIP模型以消除后门效应。在视觉识别基准上的大量实验表明,我们的策略在基于CLIP的后门防御中是有效的。