Unbounded Differentially Private Quantile and Maximum Estimation (Unbounded Differentially Private Quantile and Maximum Estimation)

In this work we consider the problem of differentially private computation of quantiles for the data, especially the highest quantiles such as maximum, but with an unbounded range for the dataset. We show that this can be done efficiently through a simple invocation of $\texttt{AboveThreshold}$, a subroutine that is iteratively called in the fundamental Sparse Vector Technique, even when there is no upper bound on the data. In particular, we show that this procedure can give more accurate and robust estimates on the highest quantiles with applications towards clipping that is essential for differentially private sum and mean estimation. In addition, we show how two invocations can handle the fully unbounded data setting. Within our study, we show that an improved analysis of $\texttt{AboveThreshold}$ can improve the privacy guarantees for the widely used Sparse Vector Technique that is of independent interest. We give a more general characterization of privacy loss for $\texttt{AboveThreshold}$ which we immediately apply to our method for improved privacy guarantees. Our algorithm only requires one $O(n)$ pass through the data, which can be unsorted, and each subsequent query takes $O(1)$ time. We empirically compare our unbounded algorithm with the state-of-the-art algorithms in the bounded setting. For inner quantiles, we find that our method often performs better on non-synthetic datasets. For the maximal quantiles, which we apply to differentially private sum computation, we find that our method performs significantly better.

翻译：在这项工作中，我们考虑了数据的差分私有计算问题，特别是对于最高分位数（例如最大值），但数据集的范围是无界的。我们证明了即使数据没有上界，也可以通过对基本的稀疏向量技巧中的$\texttt{AboveThreshold}$子例程进行迭代调用来有效地实现。特别地，我们展示了这个过程可以更准确、更稳定地估计最高分位数，它在差分私有求和和均值估计中是必不可少的。此外，我们展示了如何使用两个调用来处理完全无界的数据设置。在我们的研究中，我们展示了对于广泛使用的稀疏向量技术，$\texttt{AboveThreshold}$的改进分析可以提高隐私保证，这是独立的利益。我们给出了$\texttt{AboveThreshold}$的隐私损失的更一般的表征，我们立即应用到我们的方法中，提供了更好的隐私保证。我们的算法只需要通过数据进行一次$O(n)$的遍历，可以是未排序的，而且每个后续查询只需要$O(1)$的时间。我们在有界的设置下将未绑定的算法与现有技术的算法进行了实证比较。对于内部分位数，我们发现我们的方法在非合成数据集上通常表现更好。对于最大分位数，我们将其应用于差分私有求和计算，我们发现我们的方法表现得更好。