In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches reach their limits. However, it is still unclear why randomly initialized gradient descent optimization algorithms, such as the well-known batch gradient descent, are able to achieve zero training loss in many situations even though the objective function is non-convex and non-smooth. One of the most promising approaches to solving this problem in the field of supervised learning is the analysis of gradient descent optimization in the so-called overparameterized regime. In this article we provide a further contribution to this area of research by considering overparameterized fully-connected rectified artificial neural networks with biases. Specifically, we show that for a fixed number of training data the mean squared error using batch gradient descent optimization applied to such a randomly initialized artificial neural network converges to zero at a linear convergence rate as long as the width of the artificial neural network is large enough, the learning rate is small enough, and the training input data are pairwise linearly independent.
翻译:近年来,人工神经网络已发展成为处理一系列问题的一个有力工具,而传统解决办法已达极限;然而,仍然不清楚的是,为什么随机初始化梯度下降优化算法,如众所周知的分批梯度下降,在许多情况下能够实现零培训损失,即使客观功能是非隐形和无悬浮的。在监督学习领域解决这一问题的最有希望的方法之一是在所谓的超分化制度中分析梯度下降优化。在本条中,我们通过考虑过度校准完全连接的、有偏见的人工神经网络,为这一研究领域作出了进一步贡献。具体地说,我们表明,对于固定数量的培训数据而言,在随机初始化人工神经网络宽度足够大的情况下,对随机初始化的人工神经网络应用的分批度梯度下降优化,其平均正方形错误在线性趋同率上达到零,只要人工神经网络宽度足够大,学习率就太小,培训输入数据线性独立。