Targeted training-set attacks inject malicious instances into the training set to cause a trained model to mislabel one or more specific test instances. This work proposes the task of target identification, which determines whether a specific test instance is the target of a training-set attack. This can then be combined with adversarial-instance identification to find (and remove) the attack instances, mitigating the attack with minimal impact on other predictions. Rather than focusing on a single attack method or data modality, we build on influence estimation, which quantifies each training instance's contribution to a model's prediction. We show that existing influence estimators' poor practical performance often derives from their over-reliance on instances and iterations with large losses. Our renormalized influence estimators fix this weakness; they far outperform the original ones at identifying influential groups of training examples in both adversarial and non-adversarial settings, even finding up to 100% of adversarial training instances with no clean-data false positives. Target identification then simplifies to detecting test instances with anomalous influence values. We demonstrate our method's generality on backdoor and poisoning attacks across various data domains including text, vision, and speech. Our source code is available at https://github.com/ZaydH/target_identification .
翻译:有针对性的培训设定攻击将恶意事件注入培训内容中, 使培训模式错误地标出一个或多个特定测试实例。 这项工作提议了目标识别任务, 决定特定测试实例是否是培训设定攻击的目标。 然后, 这可以与辨别( 排除)攻击案例的对抗- 干预性识别相结合, 减轻攻击, 对其他预测的影响最小。 我们不注重单一攻击方法或数据模式, 利用影响估计, 量化每个培训实例对模型预测的贡献。 我们显示, 现有影响估计者的实际表现不佳, 往往来自他们过分依赖大量损失的事件和迭代。 我们的重新整顿性影响估计者修正了这一弱点; 它们远远超出最初在辨别对抗和非对抗性环境有影响力的培训案例的类别, 甚至发现高达100%的对抗性培训实例, 没有干净数据的假阳性。 目标识别随后简化了检测具有反常性影响价值的测试实例。 我们的方法的回溯性影响/ 图像显示我们的方法的回溯度/ 定位系统在各种数据源码上, 包括我们的语言源/ 。