Neural network optimization methods fall into two broad classes: adaptive methods such as Adam and non-adaptive methods such as vanilla stochastic gradient descent (SGD). Here, we formulate the problem of neural network optimization as Bayesian filtering. We find that state-of-the-art adaptive (AdamW) and non-adaptive (SGD) methods can be recovered by taking limits as the amount of information about the parameter gets large or small, respectively. As such, we develop a new neural network optimization algorithm, AdaBayes, that adaptively transitions between SGD-like and Adam(W)-like behaviour. This algorithm converges more rapidly than Adam in the early part of learning, and has generalisation performance competitive with SGD.
翻译:神经网络优化方法可分为两大类:适应方法,如亚当和非适应方法,如香草随机梯度梯度下降(SGD)。在这里,我们将神经网络优化问题作为贝叶斯过滤法来表述。我们发现,最先进的适应性适应性(AdamW)和非适应性(SGD)方法可以通过随着参数信息量大或小而分别使用限制来恢复。因此,我们开发了一种新的神经网络优化算法,AdaBayes,该算法适应性地在SGD类和亚当(W)类行为之间转变。在早期学习中,这一算法比Adam要快,并且具有与SGD的普及性能竞争力。