图像机器学习中数据集漂移控制的数据模型 (Data Models for Dataset Drift Controls in Machine Learning With Images)

Luis Oala,Marco Aversa,Gabriel Nobis,Kurt Willis,Yoan Neuenschwander,Michèle Buck,Christian Matek,Jerome Extermann,Enrico Pomarico,Wojciech Samek,Roderick Murray-Smith,Christoph Clausen,Bruno Sanguinetti

from arxiv, LO and MA contributed equally

Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important services spanning medicine and environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. While there are methods to prospectively validate the robustness of machine learning models to such dataset drifts, existing approaches do not account for explicit models of the primary object of interest: the data. This limits our ability to study and understand the relationship between data generation and downstream machine learning model performance in a physically accurate manner. In this study, we demonstrate how to overcome this limitation by pairing traditional machine learning with physical optics to obtain explicit and differentiable data models. We demonstrate how such data models can be constructed for image data and used to control downstream machine learning model performance related to dataset drift. The findings are distilled into three applications. First, drift synthesis enables the controlled generation of physically faithful drift test cases to power model selection and targeted generalization. Second, the gradient connection between machine learning task model and data model allows advanced, precise tolerancing of task model sensitivity to changes in the data generation. These drift forensics can be used to precisely specify the acceptable data environments in which a task model may be run. Third, drift optimization opens up the possibility to create drifts that can help the task model learn better faster, effectively optimizing the data generating process itself. A guide to access the open code and datasets is available at https://github.com/aiaudit-org/raw2logit.

翻译：相机图像在机器学习研究中是无处不在的。相机图像在提供贯穿医学和环境调查的重要服务方面也发挥着核心作用。但是,由于对稳健性的关注,在这些领域应用机器学习模型的能力有限。一个主要的失败模式是由于培训和部署数据之间的差异而导致性能下降。虽然有各种方法可以预期地验证机器学习模型对于这种数据集漂移的稳健性,但现有方法并没有考虑到主要感兴趣的对象的清晰模型:数据。这限制了我们以物理准确的方式研究和理解数据生成和下游机器学习模型的性能之间的关系。但是,在这项研究中,我们展示了如何通过将传统机器学习模型与物理光学结合起来来获得清晰和不同的数据模型来克服这一局限性。我们展示了如何为图像数据数据构建这样的数据模型,并用于控制下游机器学习模型的性能。现有方法被蒸馏成三种应用。首先, 漂浮合成可以控制地生成物理忠实的漂浮测试案例,以便选择和有针对性的一般化。其次, 机器学习任务模型和数据模型之间的梯连接可以有效地进行数据流变现。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

33页PPT【AI+天气预测】，AI and Machine learning for weather predictions

专知会员服务

34+阅读 · 2022年3月5日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日