Backdoor attacks pose a persistent security risk to deep neural networks (DNNs) due to their stealth and durability. While recent research has explored leveraging model unlearning mechanisms to enhance backdoor concealment, existing attack strategies still leave persistent traces that may be detected through static analysis. In this work, we introduce the first paradigm of revocable backdoor attacks, where the backdoor can be proactively and thoroughly removed after the attack objective is achieved. We formulate the trigger optimization in revocable backdoor attacks as a bilevel optimization problem: by simulating both backdoor injection and unlearning processes, the trigger generator is optimized to achieve a high attack success rate (ASR) while ensuring that the backdoor can be easily erased through unlearning. To mitigate the optimization conflict between injection and removal objectives, we employ a deterministic partition of poisoning and unlearning samples to reduce sampling-induced variance, and further apply the Projected Conflicting Gradient (PCGrad) technique to resolve the remaining gradient conflicts. Experiments on CIFAR-10 and ImageNet demonstrate that our method maintains ASR comparable to state-of-the-art backdoor attacks, while enabling effective removal of backdoor behavior after unlearning. This work opens a new direction for backdoor attack research and presents new challenges for the security of machine learning systems.
翻译:后门攻击因其隐蔽性和持久性,对深度神经网络(DNNs)构成了持续的安全威胁。尽管近期研究探索了利用模型遗忘机制来增强后门的隐蔽性,但现有的攻击策略仍会留下可能通过静态分析检测到的持久性痕迹。在本工作中,我们首次提出了可撤销后门攻击的范式,即后门可以在攻击目标达成后被主动且彻底地移除。我们将可撤销后门攻击中的触发器优化问题表述为一个双层优化问题:通过同时模拟后门注入和遗忘过程,优化触发器生成器,使其在实现高攻击成功率(ASR)的同时,确保后门能够通过遗忘过程被轻易擦除。为缓解注入目标与移除目标之间的优化冲突,我们采用对中毒样本和遗忘样本进行确定性划分的方法,以减少采样引起的方差,并进一步应用投影冲突梯度(PCGrad)技术来解决剩余的梯度冲突。在CIFAR-10和ImageNet数据集上的实验表明,我们的方法在保持与最先进后门攻击相当的ASR的同时,能够在遗忘后有效移除后门行为。这项工作为后门攻击研究开辟了新的方向,并对机器学习系统的安全性提出了新的挑战。