Is it possible for a first-order method, i.e., only first derivatives allowed, to be quadratically convergent? For univariate loss functions, the answer is yes -- the Steffensen method avoids second derivatives and is still quadratically convergent like Newton method. By incorporating an optimal step size we can even push its convergence order beyond quadratic to $1+\sqrt{2} \approx 2.414$. While such high convergence orders are a pointless overkill for a deterministic algorithm, they become rewarding when the algorithm is randomized for problems of massive sizes, as randomization invariably compromises convergence speed. We will introduce two adaptive learning rates inspired by the Steffensen method, intended for use in a stochastic optimization setting and requires no hyperparameter tuning aside from batch size. Extensive experiments show that they compare favorably with several existing first-order methods. When restricted to a quadratic objective, our stochastic Steffensen methods reduce to randomized Kaczmarz method -- note that this is not true for SGD or SLBFGS -- and thus we may also view our methods as a generalization of randomized Kaczmarz to arbitrary objectives.
翻译:第一阶方法,即只有第一批衍生物才允许被允许, 有可能是一阶方法的四级融合吗?对于单轨损失功能,答案是肯定的 -- -- 防御方法避免了二等衍生物,并且仍然是牛顿法的四等融合。通过采用最佳的一步尺寸,我们甚至可以将其趋同顺序推到四阶到1 ⁇ sqrt{2}\ approx 2. 414$。这种高度趋同命令对于确定性算法来说是毫无意义的,但是当算法随机地处理大尺寸问题时,它们就会得到回报,因为随机化必然会降低趋同速度。我们将采用两种适应性学习率,这些由Steff Defennt 方法启发,用于随机优化设置,并且不需要从批量大小中分离出任何超度参数调整。广泛的实验表明,它们与现有的几种一级方法相比是有利的。当限制于四等目标时,我们的随机性SGD或SLBFGS方法会减少为随机方法。因此,我们也可以将SGDG或SLBUS的随机方法视为不正确。