Although existing backdoor defenses have gained success in mitigating backdoor attacks, they still face substantial challenges. In particular, most of them rely on large amounts of clean data to weaken the backdoor mapping but generally struggle with residual trigger effects, resulting in persistently high attack success rates (ASR). Therefore, in this paper, we propose a novel Backdoor defense method based on Directional mapping module and adversarial Knowledge Distillation (BeDKD), which balances the trade-off between defense effectiveness and model performance using a small amount of clean and poisoned data. We first introduce a directional mapping module to identify poisoned data, which destroys clean mapping while keeping backdoor mapping on a small set of flipped clean data. Then, the adversarial knowledge distillation is designed to reinforce clean mapping and suppress backdoor mapping through a cycle iteration mechanism between trust and punish distillations using clean and identified poisoned data. We conduct experiments to mitigate mainstream attacks on three datasets, and experimental results demonstrate that BeDKD surpasses the state-of-the-art defenses and reduces the ASR by 98% without significantly reducing the CACC. Our code are available in https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/BeDKD.
翻译:尽管现有的后门防御方法在缓解后门攻击方面已取得一定成效,但仍面临重大挑战。特别是,多数方法依赖大量干净数据以削弱后门映射,但普遍难以消除残留的触发器效应,导致攻击成功率(ASR)持续居高不下。为此,本文提出一种基于定向映射模块与对抗性知识蒸馏的新型后门防御方法(BeDKD),该方法利用少量干净数据与投毒数据,在防御效能与模型性能之间实现平衡。我们首先引入定向映射模块来识别投毒数据,该模块通过在少量翻转的干净数据上破坏干净映射同时保持后门映射来实现识别。随后,设计对抗性知识蒸馏机制,通过使用干净数据与已识别投毒数据在信任蒸馏与惩罚蒸馏之间进行循环迭代,从而强化干净映射并抑制后门映射。我们在三个数据集上针对主流攻击开展实验,结果表明BeDKD优于现有最优防御方法,在未显著降低干净准确率(CACC)的前提下,将ASR降低了98%。代码已开源:https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/BeDKD。