There are applications that may require removing the trace of a sample from the system, e.g., a user requests their data to be deleted, or corrupted data is discovered. Simply removing a sample from storage units does not necessarily remove its entire trace since downstream machine learning models may store some information about the samples used to train them. A sample can be perfectly unlearned if we retrain all models that used it from scratch with that sample removed from their training dataset. When multiple such unlearning requests are expected to be served, unlearning by retraining becomes prohibitively expensive. Ensemble learning enables the training data to be split into smaller disjoint shards that are assigned to non-communicating weak learners. Each shard is used to produce a weak model. These models are then aggregated to produce the final central model. This setup introduces an inherent trade-off between performance and unlearning cost, as reducing the shard size reduces the unlearning cost but may cause degradation in performance. In this paper, we propose a coded learning protocol where we utilize linear encoders to encode the training data into shards prior to the learning phase. We also present the corresponding unlearning protocol and show that it satisfies the perfect unlearning criterion. Our experimental results show that the proposed coded machine unlearning provides a better performance versus unlearning cost trade-off compared to the uncoded baseline.
翻译:有一些应用可能要求从系统中去除样本的痕迹, 例如, 用户要求删除其数据, 或发现腐败的数据 。 简单地从存储器中去除样本不一定消除其全部痕迹, 因为下游机器学习模型可能储存一些关于用于培训这些样本的信息。 如果我们用从培训数据集中移除的样本重新培训所有从零到零使用样本的模型, 样本可能完全不为人知。 当许多此类未学习请求预期得到满足时, 重新培训后不学习会变得极其昂贵。 合并学习后, 使得培训数据能够分为较小的脱节的碎片, 分配给不吸收弱学习者。 每块碎片都用来生成一个薄弱的模型。 这些模型随后会汇总以生成最后的中央模型。 这个设置会引入一个内在的平衡, 用于减少不学习成本, 以降低不学习成本, 但可能导致业绩退化 。 在本文中, 我们提议一个编码学习协议, 使用线性解编码将培训数据分解成碎片, 在学习阶段前, 指定不吸收弱学习者 。 我们还可以使用这些硬的模型来生成一个不完善的测试标准, 。 我们的测试标准 显示不完善的学习 测试的测试 标准 。