高维随机梯度下降标度极限的普适性 (Universality of high-dimensional scaling limits of stochastic gradient descent)

We consider statistical tasks in high dimensions whose loss depends on the data only through its projection into a fixed-dimensional subspace spanned by the parameter vectors and certain ground truth vectors. This includes classifying mixture distributions with cross-entropy loss with one and two-layer networks, and learning single and multi-index models with one and two-layer networks. When the data is drawn from an isotropic Gaussian mixture distribution, it is known that the evolution of a finite family of summary statistics under stochastic gradient descent converges to an autonomous ordinary differential equation (ODE), as the dimension and sample size go to $\infty$ and the step size goes to $0$ commensurately. Our main result is that these ODE limits are universal in that this limit is the same whenever the data is drawn from mixtures of arbitrary product distributions whose first two moments match the corresponding Gaussian distribution, provided the initialization and ground truth vectors are coordinate-delocalized. We complement this by proving two corresponding non-universality results. We provide a simple example where the ODE limits are non-universal if the initialization is coordinate aligned. We also show that the stochastic differential equation limits arising as fluctuations of the summary statistics around their ODE's fixed points are not universal.

翻译：我们考虑高维统计任务，其损失仅通过数据在由参数向量和特定真实向量张成的固定维子空间上的投影而依赖于数据。这包括使用单层和双层网络通过交叉熵损失对混合分布进行分类，以及使用单层和双层网络学习单索引和多索引模型。当数据从各向同性高斯混合分布中抽取时，已知在随机梯度下降下，有限族摘要统计量的演化会收敛到一个自治常微分方程（ODE），前提是维度与样本量趋于$\infty$且步长相应趋于$0$。我们的主要结果表明这些ODE极限具有普适性：只要数据来自任意乘积分布的混合，且这些分布的前两阶矩与对应高斯分布匹配，同时初始化向量和真实向量是坐标去局域化的，那么该极限总是相同的。我们通过证明两个相应的非普适性结果来补充这一点。我们提供了一个简单的例子，说明如果初始化是坐标对齐的，ODE极限将不具有普适性。我们还证明了摘要统计量围绕其ODE固定点波动所产生的随机微分方程极限不具有普适性。