Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.
翻译:对RMSpop和Adam等适应性梯度方法的模拟研究一直具有挑战性,因为这些方法没有经过严格证明的SDE近似值。本文提供了RMSpop和Adam的SDE近似值,从理论上保证了这些近似值的正确性,并实验性地验证了这些近似值对共同的大缩放视觉和语言环境的适用性。一个主要的实际结果就是得出$\textit{quare root 缩放规则}$来调整RMSpop和Adam在改变批量尺寸时的优化超参数,并在深层学习环境中进行经验验证。