In this work, we consider one challenging training time attack by modifying training data with bounded perturbation, hoping to manipulate the behavior (both targeted or non-targeted) of any corresponding trained classifier during test time when facing clean samples. To achieve this, we proposed to use an auto-encoder-like network to generate the pertubation on the training data paired with one differentiable system acting as the imaginary victim classifier. The perturbation generator will learn to update its weights by watching the training procedure of the imaginary classifier in order to produce the most harmful and imperceivable noise which in turn will lead the lowest generalization power for the victim classifier. This can be formulated into a non-linear equality constrained optimization problem. Unlike GANs, solving such problem is computationally challenging, we then proposed a simple yet effective procedure to decouple the alternating updates for the two networks for stability. The method proposed in this paper can be easily extended to the label specific setting where the attacker can manipulate the predictions of the victim classifiers according to some predefined rules rather than only making wrong predictions. Experiments on various datasets including CIFAR-10 and a reduced version of ImageNet confirmed the effectiveness of the proposed method and empirical results showed that, such bounded perturbation have good transferability regardless of which classifier the victim is actually using on image data.
翻译:在这项工作中,我们考虑了一个挑战性的培训时间攻击,即修改培训数据,使其具有一定的扰动性,希望借此在面对干净的样本时操纵任何相应的受过训练的分类师的行为(目标或非目标),从而在测试期间操纵任何相应的分类师的行为(目标或非目标)。为了实现这一目标,我们提议使用一个类似于自动编码器的网络来生成对培训数据进行渗透的插管,并配有一个不同的系统作为想象中的受害者分类师。扰动生成器将学会更新其重量,通过观察想象中的分类师的培训程序来更新其重量,从而产生最有害和最易渗透的噪音,而这反过来又将引导受害者分类师的最低一般化能力。这可以形成一个非线性平等限制优化的问题。与GANs不同的是,解决这些问题具有计算上的挑战性,然后我们提议了一个简单而有效的程序来解析两个网络的交替更新,作为想象受害人的分类师。本文中建议的方法将很容易推广到标签特定设置中,攻击者可以按照某些预先界定的规则来操纵受害者分类师的预测,而不是仅仅作出错误的预测。对受害者分类师进行不精确的预测。这可以形成一个非线性预测,而可以形成一个不精确的模型的模型的模型,对各种模型的模型的模型的精确性进行实验性分析。