近似合用试验和合成数据良好试验和合成数据联合有效抽样 (Approximate Co-Sufficient Sampling for Goodness-of-fit Tests and Synthetic Data)

Co-sufficient sampling refers to resampling the data conditional on a sufficient statistic, a useful technique for statistical problems such as goodness-of-fit tests, model selection, and confidence interval construction; it is also a powerful tool to generate synthetic data which limits the disclosure risk of sensitive data. However, sampling from such conditional distributions is both technically and computationally challenging, and is inapplicable in models without low-dimensional sufficient statistics. We study an indirect inference approach to approximate co-sufficient sampling, which only requires an efficient statistic rather than a sufficient statistic. Given an efficient estimator, we prove that the expected KL divergence goes to zero between the true conditional distribution and the resulting approximate distribution. We also propose a one-step approximate solution to the optimization problem that preserves the original estimator with an error of $o_p(n^{-1/2})$, which suffices for asymptotic optimality. The one-step method is easily implemented, highly computationally efficient, and applicable to a wide variety of models, only requiring the ability to sample from the model and compute an efficient statistic. We implement our methods via simulations to tackle problems in synthetic data, hypothesis testing, and differential privacy.

翻译：共有抽样是指对数据进行重新抽样,以充分统计为条件,这是一种有用的方法,用于统计问题,如 " 适当测试 " 、 " 模型选择 " 和 " 信任间隔 " 等;它也是一个强大的工具,可以生成合成数据,限制敏感数据的披露风险;然而,从这种有条件分布的抽样在技术上和计算上都具有挑战性,无法适用于没有低维充分统计数据的模型;我们研究对大约共同满足的抽样采取间接推论方法,这只需要有效的统计,而不是足够的统计;鉴于一个高效的估测器,我们证明预期的KL值差异在真正的有条件分布和由此产生的近似分布之间达到零;我们还提出一个对优化问题的一步近似解决办法,即保留原估计值,误差为$_p(n ⁇ -1/2})美元,这足以满足零度的最佳性。一步方法很容易实施,高计算效率,并适用于广泛的模型,只需要从模型中取样的能力,并进行高效的统计。我们通过模拟,采用各种方法解决合成数据中的问题。