This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers. Specifically, for the whole dataset with size $N$, $m_N$ subsamples are randomly drawn, and each subsample with a subsample size $k_N\ll N$ to meet the memory constraint is sampled uniformly without replacement. Aggregating the estimators of $m_N$ subsamples can lead to subbagging estimation. To analyze the theoretical properties of the subbagging estimator, we adapt the incomplete $U$-statistics theory with an infinite order kernel to allow overlapping drawn subsamples in the sampling procedure. Utilizing this novel theoretical framework, we demonstrate that via a proper hyperparameter selection of $k_N$ and $m_N$, the subbagging estimator can achieve $\sqrt{N}$-consistency and asymptotic normality under the condition $(k_Nm_N)/N\to \alpha \in (0,\infty]$. Compared to the full sample estimator, we theoretically show that the $\sqrt{N}$-consistent subbagging estimator has an inflation rate of $1/\alpha$ in its asymptotic variance. Simulation experiments are presented to demonstrate the finite sample performances. An American airline dataset is analyzed to illustrate that the subbagging estimate is numerically close to the full sample estimate, and can be computationally fast under the memory constraint.
翻译:此文章引入了以存储器内存限制进行大数据分析的下调( subbbing 集合) 估计方法 。 具体地说, 对于规模为N$的整套数据集, 随机抽取 $_ N$ 美元 亚样本, 而每个子样本, 大小为 $k_ N\ll N$ 以满足内存限制, 取样时不替换。 将 $_ N$ 亚样本的估算器进行下调估计。 要分析下调估算器的理论属性, 我们用无限的顺序排序来调整不完整的美元统计理论, 以便在取样程序中允许重复绘制的子样本。 利用这个新的理论框架, 我们证明, 通过适当的超参数选择 $k_ N$ 和 $_ N$, 下调度估计器可以达到 $\ qrqrt{N} 美元 的匹配值和 默认值。 在条件 $( k_ n_ N) 下, 我们将不完全的基数统计理论理论理论理论理论理论 显示 的精确度 度 比率为 。