Small on-device models have been successfully trained with user-level differential privacy (DP) for next word prediction and image classification tasks in the past. However, existing methods can fail when directly applied to learn embedding models using supervised training data with a large class space. To achieve user-level DP for large image-to-embedding feature extractors, we propose DP-FedEmb, a variant of federated learning algorithms with per-user sensitivity control and noise addition, to train from user-partitioned data centralized in the datacenter. DP-FedEmb combines virtual clients, partial aggregation, private local fine-tuning, and public pretraining to achieve strong privacy utility trade-offs. We apply DP-FedEmb to train image embedding models for faces, landmarks and natural species, and demonstrate its superior utility under same privacy budget on benchmark datasets DigiFace, EMNIST, GLD and iNaturalist. We further illustrate it is possible to achieve strong user-level DP guarantees of $\epsilon<2$ while controlling the utility drop within 5%, when millions of users can participate in training.
翻译:过去,在使用用户一级差异隐私(DP)的小型设计模型方面,为下一个单词预测和图像分类任务成功地进行了用户级差异隐私(DP)培训。然而,如果直接应用现有方法,利用大型舱位的监管培训数据学习嵌入模型,那么现有方法可能会失败。为了实现大型图像组合地物提取器的用户级嵌入模型,我们提议采用DP-FedEmb(一种配对用户敏感控制和添加噪音的组合学习算法的变式),从集中到数据中心的用户部分数据中进行培训。DP-FedEmb(DP-FedEmb)将虚拟客户、部分汇总、私人本地微调和公共预培训结合起来,以实现强大的隐私效用交易。我们应用DP-FedEmb(DP-FedEmb)来培训脸部、地标和自然物种的图像嵌入模型,并在基准数据集DigiFace、EMNIST、GLT和iNatalist的基准隐私预算下展示其优势。我们进一步说明,在控制5 %的用户能够参加培训的情况下,在5 %的用户中实现用户级的用户级保证$\silon < 2$2$。