Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shifting the output distribution and rendering pre-computed defense ineffective. To address this, we propose Patronus, a novel framework that use input-side invariance of triggers against parameter shifts. To overcome the convergence challenges of discrete text optimization, Patronus introduces a multi-trigger contrastive search algorithm that effectively bridges gradient-based optimization with contrastive learning objectives. Furthermore, we employ a dual-stage mitigation strategy combining real-time input monitoring with model purification via adversarial training. Extensive experiments across 15 PLMs and 10 tasks demonstrate that Patronus achieves $\geq98.7\%$ backdoor detection recall and reduce attack success rates to clean settings, significantly outperforming all state-of-the-art baselines in all settings. Code is available at https://github.com/zth855/Patronus.
翻译:可迁移后门对预训练语言模型(PLMs)供应链构成严重威胁,然而防御研究仍处于起步阶段,主要依赖于检测输出特征空间中的异常。我们发现一个关键缺陷:在下游任务上进行微调不可避免地会修改模型参数,改变输出分布,从而使预先计算的防御失效。为解决此问题,我们提出了Patronus,一种利用触发器在输入侧对参数偏移具有不变性的新型框架。为克服离散文本优化的收敛挑战,Patronus引入了一种多触发器对比搜索算法,有效桥接了基于梯度的优化与对比学习目标。此外,我们采用了一种双阶段缓解策略,结合实时输入监控与通过对抗训练的模型净化。在15个PLMs和10个任务上的广泛实验表明,Patronus实现了≥98.7%的后门检测召回率,并将攻击成功率降至与干净设置相当的水平,在所有设置中显著优于所有最先进的基线方法。代码可在https://github.com/zth855/Patronus获取。