《关于大量数据微缩估计》 (On the Subbagging Estimation for Massive Data) - 专知论文

会员服务 ·

0

估计/估计量 · 子采样 · 样本 · 全 · 约束 ·

2021 年 2 月 28 日

On the Subbagging Estimation for Massive Data

翻译：《关于大量数据微缩估计》

Tao Zou,Xian Li,Xuan Liang,Hansheng Wang

This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers. Specifically, for the whole dataset with size $N$, $m_N$ subsamples are randomly drawn, and each subsample with a subsample size $k_N\ll N$ to meet the memory constraint is sampled uniformly without replacement. Aggregating the estimators of $m_N$ subsamples can lead to subbagging estimation. To analyze the theoretical properties of the subbagging estimator, we adapt the incomplete $U$-statistics theory with an infinite order kernel to allow overlapping drawn subsamples in the sampling procedure. Utilizing this novel theoretical framework, we demonstrate that via a proper hyperparameter selection of $k_N$ and $m_N$, the subbagging estimator can achieve $\sqrt{N}$-consistency and asymptotic normality under the condition $(k_Nm_N)/N\to \alpha \in (0,\infty]$. Compared to the full sample estimator, we theoretically show that the $\sqrt{N}$-consistent subbagging estimator has an inflation rate of $1/\alpha$ in its asymptotic variance. Simulation experiments are presented to demonstrate the finite sample performances. An American airline dataset is analyzed to illustrate that the subbagging estimate is numerically close to the full sample estimate, and can be computationally fast under the memory constraint.

翻译：此文章引入了以存储器内存限制进行大数据分析的下调( subbbing 集合) 估计方法。具体地说, 对于规模为N$的整套数据集, 随机抽取 $_ N$ 美元亚样本, 而每个子样本, 大小为 $k_ N\ll N$ 以满足内存限制, 取样时不替换。将 $_ N$ 亚样本的估算器进行下调估计。要分析下调估算器的理论属性, 我们用无限的顺序排序来调整不完整的美元统计理论, 以便在取样程序中允许重复绘制的子样本。利用这个新的理论框架, 我们证明, 通过适当的超参数选择 $k_ N$ 和 $_ N$, 下调度估计器可以达到 $\ qrqrt{N} 美元的匹配值和默认值。在条件 $( k_ n_ N) 下, 我们将不完全的基数统计理论理论理论理论理论理论显示的精确度度比率为。

0

相关内容

估计/估计量

估计/估计量

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

专知会员服务

67+阅读 · 2021年3月27日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

50+阅读 · 2020年12月14日

【干货书】贝叶斯推断随机过程，449页pdf

【干货书】贝叶斯推断随机过程，449页pdf

专知会员服务

149+阅读 · 2020年8月27日

低秩稀疏矩阵优化问题的模型与算法

专知会员服务

41+阅读 · 2020年7月29日

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

专知会员服务

65+阅读 · 2020年7月25日

随机特征核近似综述: 算法与理论，Random Features for Kernel Approximation: A Survey in Algorithms, Theory, and Beyond

随机特征核近似综述: 算法与理论，Random Features for Kernel Approximation: A Survey in Algorithms, Theory, and Beyond

专知会员服务

32+阅读 · 2020年4月26日

【电子书】大数据挖掘，Mining of Massive Datasets，附513页PDF

【电子书】大数据挖掘，Mining of Massive Datasets，附513页PDF

专知会员服务

101+阅读 · 2020年3月22日

【开放书】部分观测动态系统的贝叶斯学习，119页pdf，Bayesian Learning for partially observed dynamical systems

【开放书】部分观测动态系统的贝叶斯学习，119页pdf，Bayesian Learning for partially observed dynamical systems

专知会员服务

36+阅读 · 2019年12月27日

2019年机器学习框架回顾

2019年机器学习框架回顾

专知会员服务

35+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

90+阅读 · 2019年10月10日

【斯坦福大学】《海量数据集挖掘》电子书及相关资源 603页pdf《Mining of Massive Datasets》

【斯坦福大学】《海量数据集挖掘》电子书及相关资源 603页pdf《Mining of Massive Datasets》

专知

5+阅读 · 2020年3月30日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

26+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

15+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

17+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

Call4Papers

5+阅读 · 2018年12月7日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

R语言金融波动率建模|基于SGED分布的变参数ARIMA+EARCH动态预测模型的研究

R语言金融波动率建模|基于SGED分布的变参数ARIMA+EARCH动态预测模型的研究

R语言中文社区

3+阅读 · 2018年1月22日

算法｜随机森林（Random Forest）

算法｜随机森林（Random Forest）

全球人工智能

3+阅读 · 2018年1月8日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

GEAR: On Optimal Decision Making with Auxiliary Data

Arxiv

0+阅读 · 2021年4月21日

On the Asymptotic Optimality of Cross-Validation based Hyper-parameter Estimators for Regularized Least Squares Regression Problems

Arxiv

0+阅读 · 2021年4月21日

Sequential estimation of quantiles with applications to A/B-testing and best-arm identification

Arxiv

0+阅读 · 2021年4月21日

Neural Network Approximation: Three Hidden Layers Are Enough

Arxiv

0+阅读 · 2021年4月19日

Analysis-aware defeaturing: problem setting and a posteriori estimation

Arxiv

0+阅读 · 2021年4月19日

A non-asymptotic model selection in block-diagonal mixture of polynomial experts models

Arxiv

0+阅读 · 2021年4月18日

Provably Safe Tolerance Estimation for Robot Arms via Sum-of-Squares Programming

Arxiv

0+阅读 · 2021年4月18日

Discrete time approximation of fully nonlinear HJB equations via stochastic control problems under the $G$-expectation framework

Arxiv

0+阅读 · 2021年4月17日

Regularized Maximum Likelihood Estimation for the Random Coefficients Model

Arxiv

0+阅读 · 2021年4月16日

Signal Processing and Piecewise Convex Estimation

Arxiv

4+阅读 · 2018年3月14日

VIP会员

文章信息

相关主题

估计/估计量

相关VIP内容

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

专知会员服务

67+阅读 · 2021年3月27日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

50+阅读 · 2020年12月14日

【干货书】贝叶斯推断随机过程，449页pdf

【干货书】贝叶斯推断随机过程，449页pdf

专知会员服务

149+阅读 · 2020年8月27日

低秩稀疏矩阵优化问题的模型与算法

专知会员服务

41+阅读 · 2020年7月29日

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

专知会员服务

65+阅读 · 2020年7月25日

随机特征核近似综述: 算法与理论，Random Features for Kernel Approximation: A Survey in Algorithms, Theory, and Beyond

随机特征核近似综述: 算法与理论，Random Features for Kernel Approximation: A Survey in Algorithms, Theory, and Beyond

专知会员服务

32+阅读 · 2020年4月26日

【电子书】大数据挖掘，Mining of Massive Datasets，附513页PDF

【电子书】大数据挖掘，Mining of Massive Datasets，附513页PDF

专知会员服务

101+阅读 · 2020年3月22日

【开放书】部分观测动态系统的贝叶斯学习，119页pdf，Bayesian Learning for partially observed dynamical systems

【开放书】部分观测动态系统的贝叶斯学习，119页pdf，Bayesian Learning for partially observed dynamical systems

专知会员服务

36+阅读 · 2019年12月27日

2019年机器学习框架回顾

2019年机器学习框架回顾

专知会员服务

35+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

90+阅读 · 2019年10月10日

热门VIP内容

相关资讯

【斯坦福大学】《海量数据集挖掘》电子书及相关资源 603页pdf《Mining of Massive Datasets》

【斯坦福大学】《海量数据集挖掘》电子书及相关资源 603页pdf《Mining of Massive Datasets》

专知

5+阅读 · 2020年3月30日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

26+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

15+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

17+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

Call4Papers

5+阅读 · 2018年12月7日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

R语言金融波动率建模|基于SGED分布的变参数ARIMA+EARCH动态预测模型的研究

R语言金融波动率建模|基于SGED分布的变参数ARIMA+EARCH动态预测模型的研究

R语言中文社区

3+阅读 · 2018年1月22日

算法｜随机森林（Random Forest）

算法｜随机森林（Random Forest）

全球人工智能

3+阅读 · 2018年1月8日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

GEAR: On Optimal Decision Making with Auxiliary Data

Arxiv

0+阅读 · 2021年4月21日

On the Asymptotic Optimality of Cross-Validation based Hyper-parameter Estimators for Regularized Least Squares Regression Problems

Arxiv

0+阅读 · 2021年4月21日

Sequential estimation of quantiles with applications to A/B-testing and best-arm identification

Arxiv

0+阅读 · 2021年4月21日

Neural Network Approximation: Three Hidden Layers Are Enough

Arxiv

0+阅读 · 2021年4月19日

Analysis-aware defeaturing: problem setting and a posteriori estimation

Arxiv

0+阅读 · 2021年4月19日

A non-asymptotic model selection in block-diagonal mixture of polynomial experts models

Arxiv

0+阅读 · 2021年4月18日

Provably Safe Tolerance Estimation for Robot Arms via Sum-of-Squares Programming

Arxiv

0+阅读 · 2021年4月18日

Discrete time approximation of fully nonlinear HJB equations via stochastic control problems under the $G$-expectation framework

Arxiv

0+阅读 · 2021年4月17日

Regularized Maximum Likelihood Estimation for the Random Coefficients Model

Arxiv

0+阅读 · 2021年4月16日

Signal Processing and Piecewise Convex Estimation

Arxiv

4+阅读 · 2018年3月14日

微信扫码咨询专知VIP会员