Stochastic Gradient Descent (SGD) has become a cornerstone of neural network optimization due to its computational efficiency and generalization capabilities. However, the gradient noise introduced by SGD is often assumed to be uncorrelated over time, despite the common practice of epoch-based training where data is sampled without replacement. In this work, we challenge this assumption and investigate the effects of epoch-based noise correlations on the stationary distribution of discrete-time SGD with momentum. Our main contributions are twofold: First, we calculate the exact autocorrelation of the noise during epoch-based training under the assumption that the noise is independent of small fluctuations in the weight vector, revealing that SGD noise is inherently anti-correlated over time. Second, we explore the influence of these anti-correlations on the variance of weight fluctuations. We find that for directions with curvature of the loss greater than a hyperparameter-dependent crossover value, the conventional predictions of isotropic weight variance under stationarity, based on uncorrelated and curvature-proportional noise, are recovered. Anti-correlations have negligible effect here. However, for relatively flat directions, the weight variance is significantly reduced, leading to a considerable decrease in loss fluctuations compared to the constant weight variance assumption. Furthermore, we present a numerical experiment where training with these anti-correlations enhances test performance, suggesting that the inherent noise structure induced by epoch-based training may play a role in finding flatter minima that generalize better.
翻译:随机梯度下降(SGD)因其计算效率和泛化能力已成为神经网络优化的基石。然而,尽管在基于轮次的训练中普遍采用无放回数据采样,但SGD引入的梯度噪声通常被假设为随时间不相关。在本工作中,我们挑战这一假设,并研究基于轮次的噪声相关性对带动量的离散时间SGD平稳分布的影响。我们的主要贡献有两方面:首先,在假设噪声与权重向量的小幅波动无关的前提下,我们计算了基于轮次训练期间噪声的精确自相关性,揭示了SGD噪声在时间上本质上是反相关的。其次,我们探讨了这些反相关性对权重波动方差的影响。我们发现,对于损失函数曲率大于一个依赖于超参数的交叉值的方向,基于不相关且与曲率成比例的噪声假设下,关于平稳状态下各向同性权重方差的传统预测得以恢复,反相关性在此影响可忽略。然而,对于相对平坦的方向,权重方差显著减小,导致与恒定权重方差假设相比,损失波动大幅降低。此外,我们展示了一项数值实验,其中利用这些反相关性进行训练提升了测试性能,这表明基于轮次训练所诱导的固有噪声结构可能在寻找更平坦、泛化能力更强的极小值方面发挥作用。