通过分分辨反反向培训实现模型强力 (Achieving Model Robustness through Discrete Adversarial Training)

Discrete adversarial attacks are symbolic perturbations to a language input that preserve the output label but lead to a prediction error. While such attacks have been extensively explored for the purpose of evaluating model robustness, their utility for improving robustness has been limited to offline augmentation only, i.e., given a trained model, attacks are used to generate perturbed (adversarial) examples, and the model is re-trained exactly once. In this work, we address this gap and leverage discrete attacks for online augmentation, where adversarial examples are generated at every step, adapting to the changing nature of the model. We also consider efficient attacks based on random sampling, that unlike prior work are not based on expensive search-based procedures. As a second contribution, we provide a general formulation for multiple search-based attacks from past work, and propose a new attack based on best-first search. Surprisingly, we find that random sampling leads to impressive gains in robustness, outperforming the commonly-used offline augmentation, while leading to a speedup at training time of ~10x. Furthermore, online augmentation with search-based attacks justifies the higher training cost, significantly improving robustness on three datasets. Last, we show that our proposed algorithm substantially improves robustness compared to prior methods.

翻译：分辨的对抗性攻击是对保存输出标签但导致预测错误的语言输入的象征性扰动。虽然为了评价模型的稳健性而广泛探索了这类攻击,但这种攻击对于提高稳健性的作用仅限于非线性扩增,即,根据经过培训的模式,攻击被用来产生不稳(对抗)的例子,而模型则完全经过一次再培训。在这项工作中,我们解决了这一差距,利用离散攻击来进行在线扩增,每个步骤都产生对抗性攻击,从而适应模型不断变化的性质。我们还考虑到基于随机抽样的有效攻击,与以往的工作不同,这种攻击不是基于昂贵的搜索程序。作为第二项贡献,我们为过去工作中的多次搜索性攻击提供了一种通用的配方,并提出了以最佳第一搜索为基础的新攻击。令人惊讶的是,随机抽样导致在稳健性方面取得令人印象深刻的收益,超过常用的离线扩增率,同时导致在培训期间加速使用~10x。此外,通过基于搜索的网络加固性攻击说明我们提议的更高程度的训练成本,大大地改进了我们先前的强度。