Picket: 在学习和推断期间防范表格数据中的数据被损坏 (Picket: Guarding Against Corrupted Data in Tabular Data during Learning and Inference)

Data corruption is an impediment to modern machine learning deployments. Corrupted data can severely bias the learned model and can also lead to invalid inferences. We present, Picket, a simple framework to safeguard against data corruptions during both training and deployment of machine learning models over tabular data. For the training stage, Picket identifies and removes corrupted data points from the training data to avoid obtaining a biased model. For the deployment stage, Picket flags, in an online manner, corrupted query points to a trained machine learning model that due to noise will result in incorrect predictions. To detect corrupted data, Picket uses a self-supervised deep learning model for mixed-type tabular data, which we call PicketNet. To minimize the burden of deployment, learning a PicketNet model does not require any human-labeled data. Picket is designed as a plugin that can increase the robustness of any machine learning pipeline. We evaluate Picket on a diverse array of real-world data considering different corruption models that include systematic and adversarial noise during both training and testing. We show that Picket consistently safeguards against corrupted data during both training and deployment of various models ranging from SVMs to neural networks, beating a diverse array of competing methods that span from data quality validation models to robust outlier-detection models.

翻译：数据腐败是现代机器学习部署的障碍。腐败数据会严重偏向学习模式,并可能导致无效的推论。我们提出了一个简单的框架,即Picket, 用来在培训和部署机器学习模式时防止数据腐败,而不是表格数据。在培训阶段, Picket 识别并删除培训数据中的腐败数据点,以避免获得偏差模式。在部署阶段, Picket 标志, 在线, 腐败查询到一个经过培训的机器学习模式, 由噪音导致不正确的预测。为了检测腐败数据, Picket 使用一个自我监督的深层次学习模式, 用于混合类型的表格数据, 我们称之为 Picket Net 。为了最大限度地减少部署负担, 学习 Picket Net 模式不需要任何人标的数据。 Picket 设计为插件, 它可以提高任何机器学习管道的稳健性。我们从培训到测试期间, 将包含系统化和对抗性噪音的不同腐败模式, 我们显示, 在从培训到部署各种高压模型的模型中, 从培训到部署各种高压的模型中, 持续地防止腐败数据, 从高压的模型的模型, 。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/