Stochastic Gradient Descent (SGD) based training of neural networks with a large learning rate or a small batch-size typically ends in well-generalizing, flat regions of the weight space, as indicated by small eigenvalues of the Hessian of the training loss. However, the curvature along the SGD trajectory is poorly understood. An empirical investigation shows that initially SGD visits increasingly sharp regions, reaching a maximum sharpness determined by both the learning rate and the batch-size of SGD. When studying the SGD dynamics in relation to the sharpest directions in this initial phase, we find that the SGD step is large compared to the curvature and commonly fails to minimize the loss along the sharpest directions. Furthermore, using a reduced learning rate along these directions can improve training speed while leading to both sharper and better generalizing solutions compared to vanilla SGD. In summary, our analysis of the dynamics of SGD in the subspace of the sharpest directions shows that they influence the regions that SGD steers to (where larger learning rate or smaller batch size result in wider regions visited), the overall training speed, and the generalization ability of the final model.
翻译:实验性调查显示,最初,SGD访问的地区越来越尖锐,达到由SGD学习率和批量规模决定的最大敏锐度。在研究SGD动态时,我们发现,SGD步骤与这一初始阶段最敏锐的方向相比,与曲线相比大,通常无法在最锋利的方向上最大限度地减少损失。此外,利用这些方向上降低的学习率可以提高培训速度,同时与Vanilla SGD相比,使解决方案更加清晰和更加普遍。概括地说,我们对SGD在最锋利方向子空间的动态进行的分析表明,它们影响到SGD引导的区域(在所考察的区域学习率较大或分批量较小)、总体培训速度以及最终模型的总体能力。