Nearly all statistical inference methods were developed for the regime where the number $N$ of data samples is much larger than the data dimension $p$. Inference protocols such as maximum likelihood (ML) or maximum a posteriori probability (MAP) are unreliable if $p=O(N)$, due to overfitting. This limitation has for many disciplines with increasingly high-dimensional data become a serious bottleneck. We recently showed that in Cox regression for time-to-event data the overfitting errors are not just noise but take mostly the form of a bias, and how with the replica method from statistical physics once can model and predict this bias and the noise statistics. Here we extend our approach to arbitrary generalized linear regression models (GLM), with possibly correlated covariates. We analyse overfitting in ML/MAP inference without having to specify data types or regression models, relying only on the GLM form, and derive generic order parameter equations for the case of $L2$ priors. Second, we derive the probabilistic relationship between true and inferred regression coefficients in GLMs, and show that, for the relevant hyperparameter scaling and correlated covariates, the $L2$ regularization causes a predictable direction change of the coefficient vector. Our results, illustrated by application to linear, logistic, and Cox regression, enable one to correct ML and MAP inferences in GLMs systematically for overfitting bias, and thus extend their applicability into the hitherto forbidden regime $p=O(N)$.
翻译:几乎所有的统计推论方法都是针对数据样本数量为美元比数据维度大得多的制度制定的。当数据样本数量为N美元比数据维度大得多时,则制定了几乎所有的统计推论方法。当数据超配,例如最大可能性(ML)或最高后缘概率(MAP)等推论方案(MAP)不可靠时,如果美元=O(N)美元(超配),则由于超配。这一限制对许多具有日益高维数据的许多学科都是一种严重的瓶颈。我们最近显示,在Cox回归中,错误的超配不仅是噪音,而且主要采取偏差的形式,以及如果从统计物理学的复制方法中,一旦能够模拟和预测这种偏差和噪音统计数据。在这里,我们将我们的方法扩大到任意的通用线性回归模型(GLMM),可能具有相互关联的共变异性。我们分析ML/MAP的推算方法,而无需指定数据类型或回归模型,而仅仅依靠GLM格式,并且为美元案例的通用序列参数方程等方程。 其次,我们从GLMMMMM的正值真实和直线性回归系数中将真实和推算结果推导,因此,从我们的正正正正统推至一个相关的调整、直成。