Large pre-trained Vision Language Models (VLMs) demonstrate excellent generalization capabilities but remain highly susceptible to adversarial examples, posing potential security risks. To improve the robustness of VLMs against adversarial examples, adversarial prompt tuning methods are proposed to align the text feature with the adversarial image feature without changing model parameters. However, when facing various adversarial attacks, a single learnable text prompt has insufficient generalization to align well with all adversarial image features, which ultimately results in overfitting. To address the above challenge, in this paper, we empirically find that increasing the number of learned prompts yields greater robustness improvements than simply extending the length of a single prompt. Building on this observation, we propose an adversarial tuning method named \textbf{Mixture of Adversarial Prompt Tuning (MoAPT)} to enhance the generalization against various adversarial attacks for VLMs. MoAPT aims to learn mixture text prompts to obtain more robust text features. To further enhance the adaptability, we propose a conditional weight router based on the adversarial images to predict the mixture weights of multiple learned prompts, which helps obtain sample-specific mixture text features aligning with different adversarial image features. Extensive experiments across 11 datasets under different settings show that our method can achieve better adversarial robustness than state-of-the-art approaches.
翻译:大规模预训练视觉语言模型展现出优异的泛化能力,但对对抗样本仍高度敏感,存在潜在安全风险。为提升视觉语言模型对抗样本的鲁棒性,研究者提出对抗性提示调优方法,通过在不改变模型参数的情况下对齐文本特征与对抗性图像特征。然而,面对多样化的对抗攻击,单一可学习文本提示的泛化能力不足以与所有对抗性图像特征充分对齐,最终导致过拟合。针对上述挑战,本文通过实验发现,增加学习提示的数量比单纯延长单一提示长度能带来更大的鲁棒性提升。基于此观察,我们提出一种名为**对抗性提示混合调优**的调优方法,以增强视觉语言模型针对各类对抗攻击的泛化能力。MoAPT旨在通过学习混合文本提示来获得更具鲁棒性的文本特征。为进一步提升适应性,我们提出基于对抗性图像的条件权重路由器,用于预测多个学习提示的混合权重,从而获得与不同对抗性图像特征对齐的样本特异性混合文本特征。在11个数据集上的多场景实验表明,本方法能取得优于现有先进技术的对抗鲁棒性。