We introduce MaSS (Momentum-added Stochastic Solver), an accelerated SGD method for optimizing over-parametrized models. Our method is simple and efficient to implement and does not require adapting hyper-parameters or computing full gradients in the course of optimization. Experimental evaluation of MaSS for several standard architectures of deep networks, including ResNet and convolutional networks, shows improved performance over Adam and SGD both in optimization and generalization. We prove accelerated convergence of MaSS over SGD and provide analysis for hyper-parameter selection in the quadratic case as well as some results in general strongly convex setting. In contrast, we show theoretically and verify empirically that the standard SGD+Nesterov can diverge for common choices of hyper-parameter values. We also analyze the practically important question of the dependence of the convergence rate and optimal hyper-parameters as functions of the mini-batch size, demonstrating three distinct regimes: linear scaling, diminishing returns and saturation.
翻译:我们引入了加速的SGD方法,以优化超平衡模型。我们的方法简单而有效,不需要在优化过程中调整超参数或计算全梯度。对包括ResNet和革命网络在内的一些深网络标准结构的MASS实验性评估显示,在优化和普及方面,与Adam和SGD相比,业绩有所改善。我们证明,MASS比SGD更快地趋同,并且为四级情况下的超参数选择提供了分析,以及一般的强烈矩形设置的一些结果。相比之下,我们从理论上和从经验上表明,标准SGD+Nesterov可以不同地选择共同的超参数值。我们还分析了合并率和最佳超参数作为小型尺寸功能的切实重要问题,展示了三种不同的制度:线性缩放、降低回报和饱和度。