Despite their popularity in deep learning and machine learning in general, the theoretical properties of adaptive optimizers such as Adagrad, RMSProp, Adam or AdamW are not yet fully understood. In this paper, we develop a novel framework to study the stability and generalization of these optimization methods. Based on this framework, we show provable guarantees about such properties that depend heavily on a single parameter $\beta_2$. Our empirical experiments support our claims and provide practical insights into the stability and generalization properties of adaptive optimization methods.
翻译:尽管Adagrad、RMSProp、Adam或AdamW等适应性优化剂的理论特性在一般的深层次学习和机器学习中受到欢迎,但它们的理论特性尚未得到充分理解。 在本文中,我们开发了一个新的框架来研究这些优化方法的稳定性和普遍性。在这个框架的基础上,我们对严重依赖单一参数的这些属性提供了可证实的保证 $\beta_2$。我们的实验实验支持了我们的主张,并提供了对适应性优化方法的稳定性和普遍性特性的实用洞察力。