Finding lower and better-generalizing minima is crucial for deep learning. However, most existing optimizers stop searching the parameter space once they reach a local minimum. Given the complex geometric properties of the loss landscape, it is difficult to guarantee that such a point is the lowest or provides the best generalization. To address this, we propose an adaptor "E" for gradient-based optimizers. The adapted optimizer tends to continue exploring along landscape valleys (areas with low and nearly identical losses) in order to search for potentially better local minima even after reaching a local minimum. This approach increases the likelihood of finding a lower and flatter local minimum, which is often associated with better generalization. We also provide a proof of convergence for the adapted optimizers in both convex and non-convex scenarios for completeness. Finally, we demonstrate their effectiveness in an important but notoriously difficult training scenario, large-batch training, where Lamb is the benchmark optimizer. Our testing results show that the adapted Lamb, ALTO, increases the test accuracy (generalization) of the current state-of-the-art optimizer by an average of 2.5% across a variety of large-batch training tasks. This work potentially opens a new research direction in the design of optimization algorithms.
翻译:在深度学习中,寻找更低且泛化能力更强的极小值至关重要。然而,现有大多数优化器一旦达到局部极小值便会停止对参数空间的搜索。考虑到损失曲面的复杂几何特性,难以保证此类点为最低点或具备最佳泛化能力。为解决此问题,我们提出一种适用于梯度优化器的适配器“E”。经适配的优化器倾向于在达到局部极小值后,继续沿损失曲面谷底(损失值低且近乎相同的区域)进行探索,以寻找潜在更优的局部极小值。该方法提高了寻找到更低、更平坦局部极小值的可能性,而此类极小值通常与更好的泛化性能相关。为完善理论体系,我们同时给出了适配优化器在凸与非凸场景下的收敛性证明。最后,我们在重要但 notoriously 困难的大批量训练场景中验证了其有效性,其中 Lamb 作为基准优化器。测试结果表明,适配后的 Lamb 优化器(ALTO)在多种大批量训练任务中,将当前最先进优化器的测试准确率(泛化能力)平均提升了 2.5%。本工作可能为优化算法设计开辟新的研究方向。