Pre-trained language models allowed us to process downstream tasks with the help of fine-tuning, which aids the model to achieve fairly high accuracy in various Natural Language Processing (NLP) tasks. Such easily-downloaded language models from various websites empowered the public users as well as some major institutions to give a momentum to their real-life application. However, it was recently proven that models become extremely vulnerable when they are backdoor attacked with trigger-inserted poisoned datasets by malicious users. The attackers then redistribute the victim models to the public to attract other users to use them, where the models tend to misclassify when certain triggers are detected within the training sample. In this paper, we will introduce a novel improved textual backdoor defense method, named MSDT, that outperforms the current existing defensive algorithms in specific datasets. The experimental results illustrate that our method can be effective and constructive in terms of defending against backdoor attack in text domain. Code is available at https://github.com/jcroh0508/MSDT.
翻译:经过培训的语文模式使我们得以在微调的帮助下处理下游任务,微调有助于该模式在各种自然语言处理(NLP)任务中实现相当高的准确性。这些来自各种网站的轻松下载的语言模式使公共用户和一些主要机构能够推动其实际应用。然而,最近事实证明,当这些模式在后门受到恶意用户触发的有毒毒害数据集攻击时变得极为脆弱。攻击者然后将受害者模式重新分配给公众,以吸引其他用户使用这些模式。当培训样本中检测到某些触发器时,这些模式往往会错误分类。在本文件中,我们将采用新的改进后门文字防御方法,称为MSDT,该方法在具体数据集中超过了现有的防御算法。实验结果表明,我们的方法在防止文字领域的后门攻击方面是有效的和建设性的。代码可在https://github.com/jcroh0508/MSDT中查阅。