神经网络中的内插阶段过渡:在懒惰培训下记忆化和普及化 (The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training)

Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not lead to a large generalization error. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic covariates vectors in $d$ dimensions, and $N$ hidden neurons. We assume that both the sample size $n$ and the dimension $d$ are large, and they are polynomially related. Our first main result is a characterization of the eigenstructure of the empirical NT kernel in the overparametrized regime $Nd\gg n$. This characterization implies as a corollary that the minimum eigenvalue of the empirical NT kernel is bounded away from zero as soon as $Nd\gg n$, and therefore the network can exactly interpolate arbitrary labels in the same regime. Our second main result is a characterization of the generalization error of NT ridge regression including, as a special case, min-$\ell_2$ norm interpolation. We prove that, as soon as $Nd\gg n$, the test error is well approximated by the one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error of polynomial ridge regression, whereby the regularization parameter is increased by a `self-induced' term related to the high-degree components of the activation function. The polynomial degree depends on the sample size and the dimension (in particular on $\log n/\log d$).

翻译：现代神经网络往往在高度超分化的系统中运作:它们包含如此多的参数,以至于它们可以对训练集进行内插,即使实际标签被纯随机标签取代。尽管如此,它们还是对不可见数据作出良好的预测错误:对训练网进行内插不会导致大泛化错误。此外,过度平衡似乎是有益的,因为它简化了优化的景观。我们在这里研究这些现象,在神经时(NT)制度中,它们可以对两层神经网络进行这样的研究。我们考虑一个简单的数据模型,即以美元为基数的正差矢量矢量矢量矢量矢量矢量矢量,而以美元表示的正数矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量,因此,内值正值正值正值正值正值正值的内端值,因此,内基值的内基值的内基值的内值的内基值的内值的内基值的内基值的内基值会很快被内基值限制。