Data corruption is an impediment to modern machine learning deployments. Corrupted data can severely bias the learned model and can also lead to invalid inferences. We present, Picket, a simple framework to safeguard against data corruptions during both training and deployment of machine learning models over tabular data. For the training stage, Picket identifies and removes corrupted data points from the training data to avoid obtaining a biased model. For the deployment stage, Picket flags, in an online manner, corrupted query points to a trained machine learning model that due to noise will result in incorrect predictions. To detect corrupted data, Picket uses a self-supervised deep learning model for mixed-type tabular data, which we call PicketNet. To minimize the burden of deployment, learning a PicketNet model does not require any human-labeled data. Picket is designed as a plugin that can increase the robustness of any machine learning pipeline. We evaluate Picket on a diverse array of real-world data considering different corruption models that include systematic and adversarial noise during both training and testing. We show that Picket consistently safeguards against corrupted data during both training and deployment of various models ranging from SVMs to neural networks, beating a diverse array of competing methods that span from data quality validation models to robust outlier-detection models.
翻译:数据腐败是现代机器学习部署的障碍。 腐败数据会严重偏向学习模式,并可能导致无效的推论。 我们提出了一个简单的框架,即Picket, 用来在培训和部署机器学习模式时防止数据腐败,而不是表格数据。 在培训阶段, Picket 识别并删除培训数据中的腐败数据点,以避免获得偏差模式。 在部署阶段, Picket 标志, 在线, 腐败查询到一个经过培训的机器学习模式, 由噪音导致不正确的预测。 为了检测腐败数据, Picket 使用一个自我监督的深层次学习模式, 用于混合类型的表格数据, 我们称之为 Picket Net 。 为了最大限度地减少部署负担, 学习 Picket Net 模式不需要任何人标的数据。 Picket 设计为插件, 它可以提高任何机器学习管道的稳健性。 我们从培训到测试期间, 将包含系统化和对抗性噪音的不同腐败模式, 我们显示, 在从培训到部署各种高压模型的模型中, 从培训到部署各种高压的模型中, 持续地防止腐败数据, 从高压的模型的模型, 。