Backdoor attacks have become an emerging threat to NLP systems. By providing poisoned training data, the adversary can embed a ``backdoor'' into the victim model, which allows input instances satisfying certain textual patterns (e.g., containing a keyword) to be predicted as a target label of the adversary's choice. In this paper, we demonstrate that it's possible to design a backdoor attack that is both stealthy (i.e., hard to notice) and effective (i.e., has a high attack success rate). We propose BITE, a backdoor attack that poisons the training data to establish strong correlations between the target label and some ``trigger words'', by iteratively injecting them into target-label instances through natural word-level perturbations. The poisoned training data instruct the victim model to predict the target label on inputs containing trigger words, forming the backdoor. Experiments on four medium-sized text classification datasets show that BITE is significantly more effective than baselines while maintaining decent stealthiness, raising alarm on the usage of untrusted training data. We further propose a defense method named DeBITE based on potential trigger word removal, which outperforms existing methods on defending BITE and generalizes well to defending other backdoor attacks.
翻译:后门攻击已成为对 NLP 系统的一种新威胁。 通过提供有毒的培训数据, 对手可以将“ 后门” 嵌入受害者模式, 从而允许输入实例满足某些文本模式( 例如, 包含关键词), 可以预测作为对手选择的目标标签。 在本文中, 我们证明有可能设计隐形( 难以察觉) 和有效( 即, 袭击成功率高) 的后门攻击。 我们提议 BITE, 后门攻击毒化培训数据, 以建立目标标签和某些“ 触发词” 之间的紧密关联关系, 通过自然的字级扰动将它们反复注入目标标签实例。 中毒培训数据指导受害者模型预测含有触发词( ) 的输入目标标签, 形成后门。 对四个中等文本分类数据集的实验表明 BITE 比基线有效得多, 同时保持像样的隐形, 提醒使用不可靠的培训数据。 我们进一步提议一种防御方法, 以现有BITE 的潜在攻击方法, 捍卫现有BFOR 。