渐渐后代、斯托切斯托优化和其他故事 (Gradient Descent, Stochastic Optimization, and Other Tales)

The goal of this paper is to debunk and dispel the magic behind black-box optimizers and stochastic optimizers. It aims to build a solid foundation on how and why the techniques work. This manuscript crystallizes this knowledge by deriving from simple intuitions, the mathematics behind the strategies. This tutorial doesn't shy away from addressing both the formal and informal aspects of gradient descent and stochastic optimization methods. By doing so, it hopes to provide readers with a deeper understanding of these techniques as well as the when, the how and the why of applying these algorithms. Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize machine learning tasks. Its stochastic version receives attention in recent years, and this is particularly true for optimizing deep neural networks. In deep neural networks, the gradient followed by a single sample or a batch of samples is employed to save computational resources and escape from saddle points. In 1951, Robbins and Monro published \textit{A stochastic approximation method}, one of the first modern treatments on stochastic optimization that estimates local gradients with a new batch of samples. And now, stochastic optimization has become a core technology in machine learning, largely due to the development of the back propagation algorithm in fitting a neural network. The sole aim of this article is to give a self-contained introduction to concepts and mathematical tools in gradient descent and stochastic optimization.

翻译：本文的目的是解密并消除黑盒优化和蒸汽优化的魔力。它的目的是为这些技术的运作方式和原因建立坚实的基础。手稿通过简单的直觉和策略背后的数学, 使这一知识结晶。此教义并不回避解决梯度下降和蒸汽优化方法的正式和非正式方面。通过这样做, 它希望让读者更深入地了解这些技术以及应用这些算法的时间、方式和原因。渐渐下降是最受欢迎的算法之一, 以优化和最常用的方式优化机器学习任务。近些年来, 该手稿通过简单的直觉、数学版本来体现这种知识。在深层的神经网络中, 单个样本或一组样本的梯度被用来保存计算资源, 并逃离马鞍点。 1951年, Robins 和 Monro 发表了这些算法的何时、如何和为何应用这些算法。渐渐渐下降是目前第一个最受欢迎的算法, 最常用的方法之一是优化和最常用的机器学习任务。其精度版本中, 其核心的精度升级技术的精度将最终的精度转化为方法, 学习。