Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks. Yet, existing convergence guarantees for adaptive gradient methods require either convexity or smoothness, and, in the smooth setting, only guarantee convergence to a stationary point. We propose an adaptive gradient method and show that for two-layer over-parameterized neural networks -- if the width is sufficiently large (polynomially) -- then the proposed method converges \emph{to the global minimum} in polynomial time, and convergence is robust, \emph{ without the need to fine-tune hyper-parameters such as the step-size schedule and with the level of over-parametrization independent of the training error}. Our analysis indicates in particular that over-parametrization is crucial for the harnessing the full potential of adaptive gradient methods in the setting of neural networks.
翻译:AdaGrad 等适应性梯度方法被广泛用于优化神经网络。然而,适应性梯度方法的现有趋同保证要求稳健或平滑,在平滑的环境下,只能保证向固定点趋同。我们建议了适应性梯度方法,并表明,对于双层超分度神经网络而言,如果宽度足够大(多角度的),则拟议的方法在多元时间里会与全球最小值相融合,而趋同是稳健的,不需要微调超参数,如分级表和与培训错误无关的超平衡水平。 我们的分析特别表明,超分化对于在神经网络设置中充分利用适应性梯度方法的潜力至关重要。