Since the 21st century, artificial intelligence has been leading a new round of industrial revolution. Under the training framework, the optimization algorithm aims to stably converge high-dimensional optimization to local and even global minima. Entering the era of large language models, although the scale of model parameters and data has increased, Adam remains the mainstream optimization algorithm. However, compared with stochastic gradient descent (SGD) based optimization algorithms, Adam is more likely to converge to non-flat minima. To address this issue, the AdamX algorithm is proposed. Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate, which gradually weakens the learning step correction strength as training progresses, and degrades to SGD in the stable training period, thereby improving the stability of training in the stable period and possibly enhancing generalization ability. Experimental results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate, and AdamX can stably outperform Adam and its variants in terms of performance. Our code is open-sourced at https://github.com/mengzhu0308/AdamX.


翻译:自21世纪以来,人工智能引领了新一轮的产业革命。在训练框架下,优化算法的目标是将高维优化问题稳定地收敛至局部甚至全局极小值。进入大语言模型时代,尽管模型参数和数据规模大幅增加,Adam仍是主流的优化算法。然而,与基于随机梯度下降(SGD)的优化算法相比,Adam更容易收敛至非平坦极小值。为解决此问题,本文提出了AdamX算法。其核心创新在于提出了一种新型的二阶矩估计指数衰减率,该机制随着训练进程逐步减弱学习步长的修正强度,并在稳定训练阶段退化为SGD,从而提升了稳定期训练的稳定性,并可能增强泛化能力。实验结果表明,我们提出的二阶矩估计指数衰减率优于现有的二阶矩估计指数衰减率,且AdamX在性能上能够稳定地超越Adam及其变体。我们的代码已在https://github.com/mengzhu0308/AdamX开源。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员