用户级差分隐私均值估计中用户贡献的边界设定 (Bounding User Contributions for User-Level Differentially Private Mean Estimation)

We revisit the problem of releasing the sample mean of bounded samples in a dataset, privately, under user-level $\varepsilon$-differential privacy (DP). We aim to derive the optimal method of preprocessing data samples, within a canonical class of processing strategies, in terms of the error in estimation. Typical error analyses of such \emph{bounding} (or \emph{clipping}) strategies in the literature assume that the data samples are independent and identically distributed (i.i.d.), and sometimes also that all users contribute the same number of samples (data homogeneity) -- assumptions that do not accurately model real-world data distributions. Our main result in this work is a precise characterization of the preprocessing strategy that gives rise to the smallest \emph{worst-case} error over all datasets -- a \emph{distribution-independent} error metric -- while allowing for data heterogeneity. We also show via experimental studies that even for i.i.d. real-valued samples, our clipping strategy performs much better, in terms of \emph{average-case} error, than the widely used bounding strategy of Amin et al. (2019).

翻译：本文重新探讨了在用户级ε-差分隐私（DP）约束下，私有化发布数据集中有界样本均值的问题。我们旨在从估计误差的角度，在一类规范的数据预处理策略中，推导出最优的数据样本预处理方法。文献中对此类边界设定（或裁剪）策略的误差分析通常假设数据样本独立同分布（i.i.d.），有时还假设所有用户贡献相同数量的样本（数据同质性）——这些假设无法准确建模现实世界的数据分布。本工作的主要成果是：在允许数据异质性的前提下，精确刻画了能在所有数据集上产生最小最坏情况误差（一种与分布无关的误差度量）的预处理策略。通过实验研究我们还表明，即使对于独立同分布的实值样本，我们的裁剪策略在平均误差方面也显著优于Amin等人（2019）广泛使用的边界设定策略。