Large language models (LLMs) excel in diverse applications but face dual challenges: generating harmful content under jailbreak attacks and over-refusal of benign queries due to rigid safety mechanisms. These issues are further complicated by the need to accommodate different value systems and precisely align with given safety preferences. Moreover, traditional methods like SFT and RLHF lack this capability due to their costly parameter tuning requirements and inability to support multiple value systems within a single model. These problems are more obvious in multimodal large language models (MLLMs), especially in terms of heightened over-refusal in cross-modal tasks and new security risks arising from expanded attack surfaces. We propose Magic Image, an optimization-driven visual prompt framework that enhances security while reducing over-refusal. By optimizing image prompts using harmful/benign samples, our method enables a single model to adapt to different value systems and better align with given safety preferences without parameter updates. Experiments demonstrate improved safety-effectiveness balance across diverse datasets while preserving model performance, offering a practical solution for deployable MLLM safety alignment.
翻译:大语言模型(LLMs)在多样化应用中表现出色,但面临双重挑战:在越狱攻击下生成有害内容,以及因僵化的安全机制而对良性查询过度拒绝。这些问题因需要适应不同价值体系并与给定安全偏好精确对齐而进一步复杂化。此外,传统方法如监督微调(SFT)和基于人类反馈的强化学习(RLHF)由于需要昂贵的参数调整且无法在单一模型中支持多价值体系,缺乏这种能力。这些问题在多模态大语言模型(MLLMs)中更为明显,尤其是在跨模态任务中加剧的过度拒绝以及攻击面扩大带来的新安全风险。我们提出Magic Image,一种优化驱动的视觉提示框架,可在增强安全性的同时减少过度拒绝。通过使用有害/良性样本优化图像提示,我们的方法使单一模型无需参数更新即可适应不同价值体系,并更好地与给定安全偏好对齐。实验表明,该方法在多种数据集上改善了安全性与有效性的平衡,同时保持了模型性能,为可部署MLLM的安全对齐提供了实用解决方案。