Low-precision training has become crucial for reducing the computational and memory costs of large-scale deep learning. However, quantizing gradients introduces magnitude shrinkage, which can change how stochastic gradient descent (SGD) converges. In this study, we explore SGD convergence under a gradient shrinkage model, where each stochastic gradient is scaled by a factor \( q_k \in (0,1] \). We show that this shrinkage affect the usual stepsize \( μ_k \) with an effective stepsize \( μ_k q_k \), slowing convergence when \( q_{\min} < 1 \). With typical smoothness and bounded-variance assumptions, we prove that low-precision SGD still converges, but at a slower pace set by \( q_{\min} \), and with a higher steady error level due to quantization effects. We analyze theoretically how lower numerical precision slows training by treating it as gradient shrinkage within the standard SGD convergence setup.
翻译:低精度训练已成为降低大规模深度学习计算与内存成本的关键技术。然而,梯度量化会引入幅度收缩,从而改变随机梯度下降(SGD)的收敛特性。本研究基于梯度收缩模型探讨SGD的收敛行为,其中每个随机梯度均按比例因子 \( q_k \in (0,1] \) 进行缩放。我们证明这种收缩效应将常规步长 \( μ_k \) 转化为有效步长 \( μ_k q_k \),当 \( q_{\min} < 1 \) 时会减缓收敛速度。在典型的平滑性与有界方差假设下,我们证明低精度SGD仍能收敛,但其收敛速率受 \( q_{\min} \) 制约而变缓,且因量化效应会产生更高的稳态误差水平。通过将低数值精度视为标准SGD收敛框架内的梯度收缩现象,我们从理论上系统分析了数值精度降低如何延缓训练过程。