To understand how deep learning works, it is crucial to understand the training dynamics of neural networks. Several interesting hypotheses about these dynamics have been made based on empirically observed phenomena, but there exists a limited theoretical understanding of when and why such phenomena occur. In this paper, we consider the training dynamics of gradient flow on kernel least-squares objectives, which is a limiting dynamics of SGD trained neural networks. Using precise high-dimensional asymptotics, we characterize the dynamics of the fitted model in two "worlds": in the Oracle World the model is trained on the population distribution and in the Empirical World the model is trained on a sampled dataset. We show that under mild conditions on the kernel and $L^2$ target regression function the training dynamics undergo three stages characterized by the behaviors of the models in the two worlds. Our theoretical results also mathematically formalize some interesting deep learning phenomena. Specifically, in our setting we show that SGD progressively learns more complex functions and that there is a "deep bootstrap" phenomenon: during the second stage, the test error of both worlds remain close despite the empirical training error being much smaller. Finally, we give a concrete example comparing the dynamics of two different kernels which shows that faster training is not necessary for better generalization.
翻译:要了解深层学习如何起作用,就必须理解神经网络的培训动态。关于这些动态的一些有趣的假设是根据经验观察到的现象做出的,但对这些现象发生的时间和原因的理论理解有限。在本文中,我们考虑了内核最小平方目标的梯度流的培训动态,这是SGD所训练的神经网络的有限动态。我们使用精确的高维的空洞论,描述了两个“世界”中适合的模型的动态:在甲骨文世界中,模型是经过人口分布培训的,在经验世界中,模型是经过抽样数据集培训的。我们表明,在温和的内核和$L$2$目标回归条件下,培训动态是经过三个阶段的,其特征是两个世界模型的行为模式。我们的理论结果还用数学方式将一些有趣的深层学习现象正式化。具体地说,我们表明,SGD在两个“世界”中逐渐学会了更复杂的功能,并且存在一种“深度的靴系”现象:在第二个阶段,两个世界的测试错误是经过抽样的数据集,而两个世界的试验误差最终显示,尽管我们的经验性错误是比较更精确的。