Multimodal large language models (MLLMs) have achieved impressive performance across diverse tasks by jointly reasoning over textual and visual inputs. Despite their success, these models remain highly vulnerable to adversarial manipulations, raising concerns about their safety and reliability in deployment. In this work, we first generalize an approach for generating adversarial images within the HuggingFace ecosystem and then introduce SmoothGuard, a lightweight and model-agnostic defense framework that enhances the robustness of MLLMs through randomized noise injection and clustering-based prediction aggregation. Our method perturbs continuous modalities (e.g., images and audio) with Gaussian noise, generates multiple candidate outputs, and applies embedding-based clustering to filter out adversarially influenced predictions. The final answer is selected from the majority cluster, ensuring stable responses even under malicious perturbations. Extensive experiments on POPE, LLaVA-Bench (In-the-Wild), and MM-SafetyBench demonstrate that SmoothGuard improves resilience to adversarial attacks while maintaining competitive utility. Ablation studies further identify an optimal noise range (0.1-0.2) that balances robustness and utility.
翻译:多模态大语言模型(MLLMs)通过联合推理文本与视觉输入,已在多种任务中展现出卓越性能。尽管取得了成功,这些模型仍极易受到对抗性操纵的影响,引发了对其部署安全性与可靠性的担忧。本研究首先推广了在HuggingFace生态系统中生成对抗图像的方法,随后提出了SmoothGuard——一种轻量级且模型无关的防御框架,通过随机噪声注入与基于聚类的预测聚合来增强MLLMs的鲁棒性。该方法对连续模态(如图像与音频)施加高斯噪声扰动,生成多个候选输出,并应用基于嵌入的聚类以过滤受对抗性影响的预测。最终答案从多数聚类中选取,确保即使在恶意扰动下仍能获得稳定响应。在POPE、LLaVA-Bench(In-the-Wild)及MM-SafetyBench上的大量实验表明,SmoothGuard显著提升了对抗攻击的抵御能力,同时保持了具有竞争力的实用性。消融研究进一步确定了平衡鲁棒性与实用性的最优噪声范围(0.1-0.2)。