Since the 21st century, artificial intelligence has been leading a new round of industrial revolution. Under the training framework, the optimization algorithm aims to stably converge high-dimensional optimization to local and even global minima. Entering the era of large language models, although the scale of model parameters and data has increased, Adam remains the mainstream optimization algorithm. However, compared with stochastic gradient descent (SGD) based optimization algorithms, Adam is more likely to converge to non-flat minima. To address this issue, the AdamNX algorithm is proposed. Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate, which gradually weakens the learning step correction strength as training progresses, and degrades to momentum SGD in the stable training period, thereby improving the stability of training in the stable period and possibly enhancing generalization ability. Experimental results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate, and AdamNX can stably outperform Adam and its variants in terms of performance. Our code is open-sourced at https://github.com/mengzhu0308/AdamNX.


翻译:自21世纪以来,人工智能引领着新一轮产业革命。在训练框架下,优化算法的目标是将高维优化问题稳定收敛至局部乃至全局极小值。进入大语言模型时代,尽管模型参数量与数据规模大幅增长,Adam仍是主流的优化算法。然而,与基于随机梯度下降(SGD)的优化算法相比,Adam更易收敛至非平坦极小值。为解决此问题,本文提出AdamNX算法。其核心创新在于提出一种新型二阶矩估计指数衰减率,该机制随训练进程逐步减弱学习步长修正强度,并在稳定训练阶段退化为带动量的SGD,从而提升稳定期训练稳定性,并可能增强泛化能力。实验结果表明,我们提出的二阶矩估计指数衰减率优于现有二阶矩估计指数衰减率,且AdamNX在性能上能稳定超越Adam及其变体。代码已开源:https://github.com/mengzhu0308/AdamNX。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员