Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method - justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction.
翻译:作为大规模机器学习的主要工作马,Stochasteric Sleep(SGD)已经进入了阶段,它经常与AdaGrad、Adam和AMSGrad等适应性变体一起使用。本文提出了一种用于分配式机器学习的适应性随机性梯度下降方法,可被视为著名的Adam方法的通信适应性对应方----其名称是CADA。CADA的关键组成部分是一套适合适应性随机梯度的新规则,可以用来保存通信上传。新的算法在适应性上重新利用陈旧的Adam梯度,从而节省通信,并且仍然具有与原Adam相似的趋同率。在数字实验中,CADA在通信全面减少方面取得了令人印象深刻的经验性业绩。