We introduce a new second-order inertial optimization method for machine learning called INNA. It exploits the geometry of the loss function while only requiring stochastic approximations of the function values and the generalized gradients. This makes INNA fully implementable and adapted to large-scale optimization problems such as the training of deep neural networks. The algorithm combines both gradient-descent and Newton-like behaviors as well as inertia. We prove the convergence of INNA for most deep learning problems. To do so, we provide a well-suited framework to analyze deep learning loss functions involving tame optimization in which we study a continuous dynamical system together with its discrete stochastic approximations. We prove sublinear convergence for the continuous-time differential inclusion which underlies our algorithm. Additionally, we also show how standard optimization mini-batch methods applied to non-smooth non-convex problems can yield a certain type of spurious stationary points never discussed before. We address this issue by providing a theoretical framework around the new idea of $D$-criticality; we then give a simple asymptotic analysis of INNA. Our algorithm allows for using an aggressive learning rate of $o(1/\log k)$. From an empirical viewpoint, we show that INNA returns competitive results with respect to state of the art (stochastic gradient descent, ADAGRAD, ADAM) on popular deep learning benchmark problems.
翻译:我们为机器学习采用了一种新的二级惯性惯性优化方法,称为INNA。它利用了损失函数的几何学分,而只是需要功能值和通用梯度的随机近似值和通用梯度。这使得INNA完全可执行,并适应大规模优化问题,例如深神经网络的培训。算法结合了梯度-日光和牛顿相似的行为以及惰性。我们证明INNA与最深的学习问题相融合。为了做到这一点,我们提供了一个非常合适的框架,用来分析与塔米优化有关的深层次学习损失函数,其中我们只研究连续动态系统及其离散的随机偏差近度。我们证明INNA可以完全执行并适应大规模优化问题,例如深心神经网络网络的训练。此外,我们还展示了标准优化微型批量方法如何在非摩特非convex问题中产生从未讨论过的某种令人兴奋的固定点。我们通过围绕美元临界值的新概念提供理论框架来解决这个问题;我们随后从一个简单的动态系统与离散的随机偏差近近近近。我们用国基值来进行一个简单的亚基调的内基调分析。我们的国家算算法,用国基底值分析,用国基底值来显示国基底值分析。我们的国家基底值的亚学返回法,以显示国基值分析。