Correctly quantifying the robustness of machine learning models is a central aspect in judging their suitability for specific tasks, and thus, ultimately, for generating trust in the models. We show that the widely used concept of adversarial robustness and closely related metrics based on counterfactuals are not necessarily valid metrics for determining the robustness of ML models against perturbations that occur "naturally", outside specific adversarial attack scenarios. Additionally, we argue that generic robustness metrics in principle are insufficient for determining real-world-robustness. Instead we propose a flexible approach that models possible perturbations in input data individually for each application. This is then combined with a probabilistic approach that computes the likelihood that a real-world perturbation will change a prediction, thus giving quantitative information of the robustness of the trained machine learning model. The method does not require access to the internals of the classifier and thus in principle works for any black-box model. It is, however, based on Monte-Carlo sampling and thus only suited for input spaces with small dimensions. We illustrate our approach on two dataset, as well as on analytically solvable cases. Finally, we discuss ideas on how real-world robustness could be computed or estimated in high-dimensional input spaces.
翻译:正确量化机器学习模型的稳健性是判断其是否适合具体任务,从而最终产生对模型的信任的一个核心方面。我们表明,广泛使用的对抗性稳健性概念和基于反事实的密切相关的计量标准不一定是确定ML模型是否“自然”在特定对立攻击情景外发生的扰动是否“自然”的有效的衡量标准。此外,我们争辩说,通用稳健性指标原则上不足以确定真实世界的强力性。相反,我们建议一种灵活方法,为每个应用程序的输入数据分别模拟可能的扰动。然后,将这一概念与一种概率性方法结合起来,即假设现实世界的扰动将改变预测的可能性,从而提供经过训练的机器学习模型是否稳健的定量信息。这种方法不需要进入分类器的内部,因此也不需要对任何黑箱模型进行原则上的工作。但是,它是基于蒙特-卡洛的取样,因此仅适合小层面输入空间输入。我们用两种概率方法来说明我们的方法,即真实世界的扰动性分析模型最终如何在高空域中进行高度分析。