Accelerated training algorithms, such as adaptive learning rates (or preconditioning) and various normalization methods, are widely used but not fully understood. When regularization is introduced, standard optimizers like adaptive learning rates may not perform effectively. This raises the need for alternative regularization approaches such as AdamW and the question of how to properly combine regularization with preconditioning. In this paper, we address these challenges using the theory of preconditioning as follows: (1) We explain how AdaGrad, RMSProp, and Adam accelerates training through improving Hessian conditioning; (2) We explore the interaction between $L_2$-regularization and preconditioning, demonstrating that AdamW amounts to selecting the underlying intrinsic parameters for regularization, and we derive a generalization for the $L_1$-regularization; and (3) We demonstrate how various normalization methods such as input data normalization, batch normalization, and layer normalization accelerate training by improving Hessian conditioning. Our analysis offers a unified mathematical framework for understanding various acceleration techniques or deriving appropriate regularization schemes.
翻译:加速训练算法,如自适应学习率(或称预条件处理)与各类归一化方法,已被广泛应用但其原理尚未被完全理解。当引入正则化时,自适应学习率等标准优化器可能无法有效工作。这引出了对AdamW等替代正则化方法的需求,以及如何正确将正则化与预条件处理相结合的问题。本文基于预条件处理理论应对这些挑战如下:(1)我们阐释了AdaGrad、RMSProp和Adam如何通过改善Hessian矩阵条件数来加速训练;(2)我们探究了$L_2$-正则化与预条件处理间的相互作用,证明AdamW相当于为正则化选择基础内在参数,并推导出$L_1$-正则化的推广形式;(3)我们展示了输入数据归一化、批量归一化及层归一化等各类归一化方法如何通过改善Hessian条件数来加速训练。我们的分析为理解各种加速技术或推导合适的正则化方案提供了统一的数学框架。