对通用线性模型超配的复制分析 (Replica analysis of overfitting in generalized linear models)

Nearly all statistical inference methods were developed for the regime where the number $N$ of data samples is much larger than the data dimension $p$. Inference protocols such as maximum likelihood (ML) or maximum a posteriori probability (MAP) are unreliable if $p=O(N)$, due to overfitting. This limitation has for many disciplines with increasingly high-dimensional data become a serious bottleneck. We recently showed that in Cox regression for time-to-event data the overfitting errors are not just noise but take mostly the form of a bias, and how with the replica method from statistical physics once can model and predict this bias and the noise statistics. Here we extend our approach to arbitrary generalized linear regression models (GLM), with possibly correlated covariates. We analyse overfitting in ML/MAP inference without having to specify data types or regression models, relying only on the GLM form, and derive generic order parameter equations for the case of $L2$ priors. Second, we derive the probabilistic relationship between true and inferred regression coefficients in GLMs, and show that, for the relevant hyperparameter scaling and correlated covariates, the $L2$ regularization causes a predictable direction change of the coefficient vector. Our results, illustrated by application to linear, logistic, and Cox regression, enable one to correct ML and MAP inferences in GLMs systematically for overfitting bias, and thus extend their applicability into the hitherto forbidden regime $p=O(N)$.

翻译：几乎所有的统计推论方法都是针对数据样本数量为美元比数据维度大得多的制度制定的。当数据样本数量为N美元比数据维度大得多时,则制定了几乎所有的统计推论方法。当数据超配,例如最大可能性(ML)或最高后缘概率(MAP)等推论方案(MAP)不可靠时,如果美元=O(N)美元(超配),则由于超配。这一限制对许多具有日益高维数据的许多学科都是一种严重的瓶颈。我们最近显示,在Cox回归中,错误的超配不仅是噪音,而且主要采取偏差的形式,以及如果从统计物理学的复制方法中,一旦能够模拟和预测这种偏差和噪音统计数据。在这里,我们将我们的方法扩大到任意的通用线性回归模型(GLMM),可能具有相互关联的共变异性。我们分析ML/MAP的推算方法,而无需指定数据类型或回归模型,而仅仅依靠GLM格式,并且为美元案例的通用序列参数方程等方程。其次,我们从GLMMMMM的正值真实和直线性回归系数中将真实和推算结果推导,因此,从我们的正正正正统推至一个相关的调整、直成。

相关内容

过拟合

关注 8

过拟合，在AI领域多指机器学习得到模型太过复杂，导致在训练集上表现很好，然而在测试集上却不尽人意。过拟合（over-fitting）也称为过学习，它的直观表现是算法在训练集上表现好，但在测试集上表现不好，泛化性能差。过拟合是在模型参数拟合过程中由于训练数据包含抽样误差，在训练时复杂的模型将抽样误差也进行了拟合导致的。

【快讯】ICML 2020论文出炉，1088篇上榜，你的paper中了吗？

专知会员服务

52+阅读 · 2020年6月1日

因果图，Causal Graphs，52页ppt

专知会员服务

250+阅读 · 2020年4月19日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务

162+阅读 · 2020年1月16日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日