A key tool for building differentially private systems is adding Gaussian noise to the output of a function evaluated on a sensitive dataset. Unfortunately, using a continuous distribution presents several practical challenges. First and foremost, finite computers cannot exactly represent samples from continuous distributions, and previous work has demonstrated that seemingly innocuous numerical errors can entirely destroy privacy. Moreover, when the underlying data is itself discrete (e.g., population counts), adding continuous noise makes the result less interpretable. With these shortcomings in mind, we introduce and analyze the discrete Gaussian in the context of differential privacy. Specifically, we theoretically and experimentally show that adding discrete Gaussian noise provides essentially the same privacy and accuracy guarantees as the addition of continuous Gaussian noise. We also present an simple and efficient algorithm for exact sampling from this distribution. This demonstrates its applicability for privately answering counting queries, or more generally, low-sensitivity integer-valued queries.

### 相关内容

Source: Apple - iOS 8

Differential privacy is a strong mathematical notion of privacy. Still, a prominent challenge when using differential privacy in real data collection is understanding and counteracting the accuracy loss that differential privacy imposes. As such, the accuracy-privacy trade-off of differential privacy needs to be balanced on a case-by-case basis. Applications in the literature tend to focus solely on analytical accuracy bounds, not include data in error prediction, or use arbitrary settings to measure error empirically. To fill the gap in the literature, we propose a novel application of factor experiments to create data aware error predictions. Basically, factor experiments provide a systematic approach to conducting empirical experiments. To demonstrate our methodology in action, we conduct a case study where error is dependent on arbitrarily complex tree structures. We first construct a tool to simulate poll data. Next, we use our simulated data to construct a least squares model to predict error. Last, we show how to validate the model. Consequently, our contribution is a method for constructing error prediction models that are data aware.

In this paper, we consider possibly misspecified stochastic differential equation models driven by L\'{e}vy processes. Regardless of whether the driving noise is Gaussian or not, Gaussian quasi-likelihood estimator can estimate unknown parameters in the drift and scale coefficients. However, in the misspecified case, the asymptotic distribution of the estimator varies by the correction of the misspecification bias, and consistent estimators for the asymptotic variance proposed in the correctly specified case may lose theoretical validity. As one of its solutions, we propose a bootstrap method for approximating the asymptotic distribution. We show that our bootstrap method theoretically works in both correctly specified case and misspecified case without assuming the precise distribution of the driving noise.

Centered Gaussian random fields (GRFs) indexed by compacta such as smooth, bounded Euclidean domains or smooth, compact and orientable manifolds are determined by their covariance operators. We consider centered GRFs given as variational solutions to coloring operator equations driven by spatial white noise, with an elliptic self-adjoint pseudodifferential coloring operator from the H\"ormander class. This includes the Mat\'ern class of GRFs as a special case. Using biorthogonal multiresolution analyses on the manifold, we prove that the precision and covariance operators, respectively, may be identified with bi-infinite matrices and finite sections may be diagonally preconditioned rendering the condition number independent of the dimension $p$ of this section. We prove that a tapering strategy by thresholding applied on finite sections of the bi-infinite precision and covariance matrices results in optimally numerically sparse approximations. That is, asymptotically only linearly many nonzero matrix entries are sufficient to approximate the original section of the bi-infinite covariance or precision matrix using this tapering strategy to arbitrary precision. The locations of these nonzero matrix entries are known a priori. The tapered covariance or precision matrices may also be optimally diagonally preconditioned. Analysis of the relative size of the entries of the tapered covariance matrices motivates novel, multilevel Monte Carlo (MLMC) oracles for covariance estimation, in sample complexity that scales log-linearly with respect to the number $p$ of parameters. In addition, we propose and analyze a novel compressive algorithm for simulating and kriging of GRFs. The complexity (work and memory vs. accuracy) of these three algorithms scales near-optimally in terms of the number of parameters $p$ of the sample-wise approximation of the GRF in Sobolev scales.

We consider stochastic optimization problems where a smooth (and potentially nonconvex) objective is to be minimized using a stochastic first-order oracle. These type of problems arise in many settings from simulation optimization to deep learning. We present Retrospective Approximation (RA) as a universal sequential sample-average approximation (SAA) paradigm where during each iteration $k$, a sample-path approximation problem is implicitly generated using an adapted sample size $M_k$, and solved (with prior solutions as "warm start") to an adapted error tolerance $\epsilon_k$, using a "deterministic method" such as the line search quasi-Newton method. The principal advantage of RA is that decouples optimization from stochastic approximation, allowing the direct adoption of existing deterministic algorithms without modification, thus mitigating the need to redesign algorithms for the stochastic context. A second advantage is the obvious manner in which RA lends itself to parallelization. We identify conditions on $\{M_k, k \geq 1\}$ and $\{\epsilon_k, k\geq 1\}$ that ensure almost sure convergence and convergence in $L_1$-norm, along with optimal iteration and work complexity rates. We illustrate the performance of RA with line-search quasi-Newton on an ill-conditioned least squares problem, as well as an image classification problem using a deep convolutional neural net.

Differential privacy allows bounding the influence that training data records have on a machine learning model. To use differential privacy in machine learning, data scientists must choose privacy parameters $(\epsilon,\delta)$. Choosing meaningful privacy parameters is key since models trained with weak privacy parameters might result in excessive privacy leakage, while strong privacy parameters might overly degrade model utility. However, privacy parameter values are difficult to choose for two main reasons. First, the upper bound on privacy loss $(\epsilon,\delta)$ might be loose, depending on the chosen sensitivity and data distribution of practical datasets. Second, legal requirements and societal norms for anonymization often refer to individual identifiability, to which $(\epsilon,\delta)$ are only indirectly related. We transform $(\epsilon,\delta)$ to a bound on the Bayesian posterior belief of the adversary assumed by differential privacy concerning the presence of any record in the training dataset. The bound holds for multidimensional queries under composition, and we show that it can be tight in practice. Furthermore, we derive an identifiability bound, which relates the adversary assumed in differential privacy to previous work on membership inference adversaries. We formulate an implementation of this differential privacy adversary that allows data scientists to audit model training and compute empirical identifiability scores and empirical $(\epsilon,\delta)$.

The direct Gaussian copula model with discrete marginal distributions is an appealing data-analytic tool but poses difficult computational challenges due to its intractable likelihood. A number of approximations/surrogates for the likelihood have been proposed, including the continuous extension-based approximation (CE) and the distributional transform-based approximation (DT). The continuous extension approach is exact up to Monte Carlo error but does not scale well computationally. The distributional transform approach permits efficient computation but offers no theoretical guarantee that it is exact. In practice, though, the distributional transform-based approximate likelihood is so very nearly exact for some variants of the model as to permit genuine maximum likelihood or Bayesian inference. We demonstrate the exactness of the distributional transform-based objective function for two interesting variants of the model, and propose a quantity that can be used to assess exactness for experimentally observed datasets. Said diagnostic will permit practitioners to determine whether genuine Bayesian inference or ordinary maximum likelihood inference using the DT-based likelihood is possible for a given dataset.

Approximate Bayesian computation (ABC) is a likelihood-free inference method that has been employed in various applications. However, ABC can be sensitive to outliers if a data discrepancy measure is chosen inappropriately. In this paper, we propose to use a nearest-neighbor-based $\gamma$-divergence estimator as a data discrepancy measure. We show that our estimator possesses a suitable theoretical robustness property called the redescending property. In addition, our estimator enjoys various desirable properties such as high flexibility, asymptotic unbiasedness, almost sure convergence, and linear-time computational complexity. Through experiments, we demonstrate that our method achieves significantly higher robustness than existing discrepancy measures.

Mendelian randomization (MR) is a statistical method exploiting genetic variants as instrumental variables to estimate the causal effect of modifiable risk factors on an outcome of interest. Despite wide uses of various popular two-sample MR methods based on genome-wide association study summary level data, however, those methods could suffer from potential power loss or/and biased inference when the chosen genetic variants are in linkage disequilibrium (LD), and have relatively large direct effects on the outcome whose distribution might be heavy-tailed which is commonly referred to as the idiosyncratic pleiotropy. To resolve those two issues, we propose a novel Robust Bayesian Mendelian Randomization (RBMR) model that uses the more robust multivariate generalized $t$-distribution to model such direct effects in a probabilistic model framework which can also incorporate the LD structure explicitly. The generalized $t$-distribution can be represented as a Gaussian scaled mixture so that our model parameters can be estimated by the EM-type algorithms. We compute the standard errors by calibrating the evidence lower bound (ELBO) using the likelihood ratio test. Through extensive simulation studies, we show that our RBMR has robust performance compared to other competing methods. We also apply our RBMR method to two benchmark data sets and find that RBMR has smaller bias and standard errors. Using our proposed RBMR method, we found that coronary artery disease (CAD) is associated with increased risk of coronavirus disease 2019 (COVID-19). We also develop a user-friendly R package RBMR for public use.

Train machine learning models on sensitive user data has raised increasing privacy concerns in many areas. Federated learning is a popular approach for privacy protection that collects the local gradient information instead of real data. One way to achieve a strict privacy guarantee is to apply local differential privacy into federated learning. However, previous works do not give a practical solution due to three issues. First, the noisy data is close to its original value with high probability, increasing the risk of information exposure. Second, a large variance is introduced to the estimated average, causing poor accuracy. Last, the privacy budget explodes due to the high dimensionality of weights in deep learning models. In this paper, we proposed a novel design of local differential privacy mechanism for federated learning to address the abovementioned issues. It is capable of making the data more distinct from its original value and introducing lower variance. Moreover, the proposed mechanism bypasses the curse of dimensionality by splitting and shuffling model updates. A series of empirical evaluations on three commonly used datasets, MNIST, Fashion-MNIST and CIFAR-10, demonstrate that our solution can not only achieve superior deep learning performance but also provide a strong privacy guarantee at the same time.

We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a black-box differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.

Helmut Harbrecht,Lukas Herrmann,Kristin Kirchner,Christoph Schwab
0+阅读 · 3月7日
David Newton,Raghu Bollapragada,Raghu Pasupathy,Nung Kwan Yip
0+阅读 · 3月7日
Daniel Bernau,Günther Eibl,Philip W. Grassal,Hannah Keller,Florian Kerschbaum
0+阅读 · 3月5日
Masahiro Fujisawa,Takeshi Teshima,Issei Sato,Masashi Sugiyama
0+阅读 · 3月5日
Lichao Sun,Jianwei Qian,Xun Chen,Philip S. Yu
3+阅读 · 2020年7月31日
Ricky T. Q. Chen,Yulia Rubanova,Jesse Bettencourt,David Duvenaud
5+阅读 · 2018年10月3日

107+阅读 · 2020年4月19日

91+阅读 · 2019年12月23日

45+阅读 · 2019年10月11日

28+阅读 · 2019年10月10日

CreateAMind
6+阅读 · 2019年5月18日

8+阅读 · 2019年1月30日
CreateAMind
6+阅读 · 2019年1月7日

10+阅读 · 2018年12月24日
CreateAMind
7+阅读 · 2018年12月10日
CreateAMind
20+阅读 · 2018年9月12日
CreateAMind
3+阅读 · 2018年4月15日

22+阅读 · 2017年11月16日
CreateAMind
5+阅读 · 2017年8月4日
Top