Patronus：识别与缓解预训练语言模型中的可迁移后门 (Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models)

Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shifting the output distribution and rendering pre-computed defense ineffective. To address this, we propose Patronus, a novel framework that use input-side invariance of triggers against parameter shifts. To overcome the convergence challenges of discrete text optimization, Patronus introduces a multi-trigger contrastive search algorithm that effectively bridges gradient-based optimization with contrastive learning objectives. Furthermore, we employ a dual-stage mitigation strategy combining real-time input monitoring with model purification via adversarial training. Extensive experiments across 15 PLMs and 10 tasks demonstrate that Patronus achieves $\geq98.7\%$ backdoor detection recall and reduce attack success rates to clean settings, significantly outperforming all state-of-the-art baselines in all settings. Code is available at https://github.com/zth855/Patronus.

翻译：可迁移后门对预训练语言模型（PLMs）供应链构成严重威胁，然而防御研究仍处于起步阶段，主要依赖于检测输出特征空间中的异常。我们发现一个关键缺陷：在下游任务上进行微调不可避免地会修改模型参数，改变输出分布，从而使预先计算的防御失效。为解决此问题，我们提出了Patronus，一种利用触发器在输入侧对参数偏移具有不变性的新型框架。为克服离散文本优化的收敛挑战，Patronus引入了一种多触发器对比搜索算法，有效桥接了基于梯度的优化与对比学习目标。此外，我们采用了一种双阶段缓解策略，结合实时输入监控与通过对抗训练的模型净化。在15个PLMs和10个任务上的广泛实验表明，Patronus实现了≥98.7%的后门检测召回率，并将攻击成功率降至与干净设置相当的水平，在所有设置中显著优于所有最先进的基线方法。代码可在https://github.com/zth855/Patronus获取。