Fine-tuning large language models (LLMs) is a common practice to adapt generalist models to specialized domains. However, recent studies show that fine-tuning can erode safety alignment, causing LLMs to respond to harmful or unethical prompts. Many methods to realign safety have been proposed, but often introduce custom algorithms that are difficult to implement or compromise task utility. In this work, we propose SafeMERGE, a lightweight, post-fine-tuning framework that preserves safety while maintaining downstream performance. SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across three LLMs and two tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility. Our results demonstrate that selective layer-wise merging offers an effective safeguard against the inadvertent loss of safety during fine-tuning, establishing SafeMERGE as a simple post-fine-tuning defense.
翻译:微调大语言模型(LLMs)是将通用模型适配到专业领域的常见做法。然而,近期研究表明,微调可能削弱安全对齐,导致LLMs对有害或不道德的提示作出响应。已有多种重新对齐安全性的方法被提出,但这些方法通常引入难以实现或损害任务效用的定制算法。在本工作中,我们提出SafeMERGE,一种轻量级的微调后框架,能够在保持下游性能的同时维护安全性。SafeMERGE通过余弦相似度准则衡量微调层与安全对齐模型层的偏离程度,仅当微调层表现出不安全行为时,才选择性地将其与安全对齐层融合。在三种LLMs和两项任务上的实验表明,相较于其他防御方法,SafeMERGE持续减少了有害输出,且对任务效用影响可忽略甚至具有积极影响。我们的结果表明,选择性逐层融合为微调过程中意外丧失安全性提供了有效保障,确立了SafeMERGE作为一种简单的微调后防御方法。