Nonuniform subsampling methods are effective to reduce computational burden and maintain estimation efficiency for massive data. Existing methods mostly focus on subsampling with replacement due to its high computational efficiency. If the data volume is so large that nonuniform subsampling probabilities cannot be calculated all at once, then subsampling with replacement is infeasible to implement. This paper solves this problem using Poisson subsampling. We first derive optimal Poisson subsampling probabilities in the context of quasi-likelihood estimation under the A- and L-optimality criteria. For a practically implementable algorithm with approximated optimal subsampling probabilities, we establish the consistency and asymptotic normality of the resultant estimators. To deal with the situation that the full data are stored in different blocks or at multiple locations, we develop a distributed subsampling framework, in which statistics are computed simultaneously on smaller partitions of the full data. Asymptotic properties of the resultant aggregated estimator are investigated. We illustrate and evaluate the proposed strategies through numerical experiments on simulated and real data sets.
翻译:不统一的子抽样方法对于减少计算负担和保持大数据的估计效率是有效的。现有方法主要侧重于子抽样,并由于其计算效率高而替换。如果数据体积如此大,无法同时全部计算出非统一的子抽样概率,那么用替换进行子抽样是不可行的。本文用Poisson子抽样方法解决这个问题。我们首先根据A和L最佳度标准,在准相似性估计范围内,得出最佳的Poisson子抽样概率的最佳方法。对于具有近似性亚抽样概率的实用可执行算法,我们通过模拟结果的概率,建立结果估计者的一致性和无损正常性。为了处理全部数据储存在不同区或多个地点的情况,我们开发一个分布式子抽样框架,同时计算关于全部数据的较小间隔的统计资料。对结果汇总估测仪的精确属性进行了调查。我们通过模拟数据,来说明和评估拟议的各项战略。