LLMs remain vulnerable to jailbreak attacks that exploit adversarial prompts to circumvent safety measures. Current safety fine-tuning approaches face two critical limitations. First, they often fail to strike a balance between security and utility, where stronger safety measures tend to over-reject harmless user requests. Second, they frequently miss malicious intent concealed within seemingly benign tasks, leaving models exposed to exploitation. Our work identifies a fundamental cause of these issues: during response generation, an LLM's capacity to differentiate harmful from safe outputs deteriorates. Experimental evidence confirms this, revealing that the separability between hidden states for safe and harmful responses diminishes as generation progresses. This weakening discrimination forces models to make compliance judgments earlier in the generation process, restricting their ability to recognize developing harmful intent and contributing to both aforementioned failures. To mitigate this vulnerability, we introduce DEEPALIGN - an inherent defense framework that enhances the safety of LLMs. By applying contrastive hidden-state steering at the midpoint of response generation, DEEPALIGN amplifies the separation between harmful and benign hidden states, enabling continuous intrinsic toxicity detection and intervention throughout the generation process. Across diverse LLMs spanning varying architectures and scales, it reduced attack success rates of nine distinct jailbreak attacks to near-zero or minimal. Crucially, it preserved model capability while reducing over-refusal. Models equipped with DEEPALIGN exhibited up to 3.5% lower error rates in rejecting challenging benign queries and maintained standard task performance with less than 1% decline. This marks a substantial advance in the safety-utility Pareto frontier.
翻译:大语言模型(LLM)仍然容易受到利用对抗性提示绕过安全措施的越狱攻击。当前的安全微调方法面临两个关键局限。首先,它们往往难以在安全性与实用性之间取得平衡,更强的安全措施倾向于过度拒绝无害的用户请求。其次,它们经常忽略隐藏在看似良性任务中的恶意意图,使模型暴露于被利用的风险。我们的工作揭示了这些问题的根本原因:在响应生成过程中,LLM区分有害输出与安全输出的能力会逐渐退化。实验证据证实了这一点,表明安全响应与有害响应对应的隐藏状态之间的可分离性随着生成的进行而减弱。这种判别力的减弱迫使模型在生成过程的早期阶段就做出合规性判断,限制了其识别正在形成的有害意图的能力,并导致了上述两种失败。为了缓解这一脆弱性,我们提出了DEEPALIGN——一个增强LLM安全性的固有防御框架。通过在响应生成的中期应用对比性隐藏状态引导,DEEPALIGN放大了有害与良性隐藏状态之间的分离,从而能够在整个生成过程中实现持续的内在毒性检测与干预。在涵盖不同架构和规模的多种LLM上,它将九种不同越狱攻击的成功率降低至接近零或极低水平。至关重要的是,它在减少过度拒绝的同时保留了模型能力。配备DEEPALIGN的模型在拒绝具有挑战性的良性查询时错误率降低了高达3.5%,并且标准任务性能的下降幅度小于1%。这标志着在安全-实用性帕累托前沿上取得了实质性进展。