Adversarial examples are malicious inputs crafted to cause a model to misclassify them. Their most common instantiation, "perturbation-based" adversarial examples introduce changes to the input that leave its true label unchanged, yet result in a different model prediction. Conversely, "invariance-based" adversarial examples insert changes to the input that leave the model's prediction unaffected despite the underlying input's label having changed. In this paper, we demonstrate that robustness to perturbation-based adversarial examples is not only insufficient for general robustness, but worse, it can also increase vulnerability of the model to invariance-based adversarial examples. In addition to analytical constructions, we empirically study vision classifiers with state-of-the-art robustness to perturbation-based adversaries constrained by an $\ell_p$ norm. We mount attacks that exploit excessive model invariance in directions relevant to the task, which are able to find adversarial examples within the $\ell_p$ ball. In fact, we find that classifiers trained to be $\ell_p$-norm robust are more vulnerable to invariance-based adversarial examples than their undefended counterparts. Excessive invariance is not limited to models trained to be robust to perturbation-based $\ell_p$-norm adversaries. In fact, we argue that the term adversarial example is used to capture a series of model limitations, some of which may not have been discovered yet. Accordingly, we call for a set of precise definitions that taxonomize and address each of these shortcomings in learning.
翻译:反之, “ 偏差” 对抗性示例插入了使模型预测不受影响的变化, 尽管基本输入标签已经改变, 却让模型预测不受影响。 在本文中, 我们证明对基于扰动的对抗性实例的稳健性不仅不足以达到一般强力, 更糟糕的是, 它还会增加模型对基于逆差的对抗性实例的脆弱性。 除了分析结构外, 我们实验性地研究具有最强性、 最强性、 最强性、 最强性、 受 $\ell_ p$ 标准制约的视觉分类师。 我们发动攻击, 利用与任务相关的过度的模型, 能够在 $\ ell_ p$ ball 中找到对抗性对抗性实例。 事实上, 我们发现, 被训练为以正性货币为基数的分类员, 不是以正性货币为基数的数值为基数, 而我们所训练的税性序列中, 也比我们所训练的正确性术语更可靠。