差分隐私垂直联邦聚类 (Differentially Private Vertical Federated Clustering)

In many applications, multiple parties have private data regarding the same set of users but on disjoint sets of attributes, and a server wants to leverage the data to train a model. To enable model learning while protecting the privacy of the data subjects, we need vertical federated learning (VFL) techniques, where the data parties share only information for training the model, instead of the private data. However, it is challenging to ensure that the shared information maintains privacy while learning accurate models. To the best of our knowledge, the algorithm proposed in this paper is the first practical solution for differentially private vertical federated k-means clustering, where the server can obtain a set of global centers with a provable differential privacy guarantee. Our algorithm assumes an untrusted central server that aggregates differentially private local centers and membership encodings from local data parties. It builds a weighted grid as the synopsis of the global dataset based on the received information. Final centers are generated by running any k-means algorithm on the weighted grid. Our approach for grid weight estimation uses a novel, light-weight, and differentially private set intersection cardinality estimation algorithm based on the Flajolet-Martin sketch. To improve the estimation accuracy in the setting with more than two data parties, we further propose a refined version of the weights estimation algorithm and a parameter tuning strategy to reduce the final k-means utility to be close to that in the central private setting. We provide theoretical utility analysis and experimental evaluation results for the cluster centers computed by our algorithm and show that our approach performs better both theoretically and empirically than the two baselines based on existing techniques.

翻译：在许多应用中，多个方具有关于相同用户集合但不同属性集合的私有数据，服务端希望利用这些数据来训练模型，同时保护数据主体的隐私。为了能够实现模型学习同时保护隐私，我们需要垂直联邦学习技术（VFL），数据方仅共享训练模型所需的信息而非私有数据。然而，在学习准确模型的同时保持共享信息的私密性是具有挑战性的。据我们所知，本文提出的算法是针对差分隐私垂直联邦k-均值聚类的第一种实用方案，服务端可以获取一组带有可证明差分隐私保证的全局中心。我们提出了一种算法，假定一个不可信任的中央服务器从本地数据方收集差分隐私的本地中心和成员编码，通过生成一个基于所收到信息的加权网络，可以构建整个数据集的概要。通过在加权网络上运行任意的k-均值算法来生成最终中心。我们的加权格估计方法使用了一种基于Flajolet-Martin草稿的新颖、轻量级且差分隐私的集合交集基数估计算法。为了提高在超过两个数据方的情况下的估计精度，我们进一步提出了一种更精细的加权估计算法和参数调整策略，以将最终的k-均值实用性降低至接近于私有集中设置。我们提供了理论效用分析和实验评估结果，展示了我们的方法对于计算聚类中心表现得比两个基于现有技术的基准更好，既在理论上又在实验上。