大规模机器学习中的动态控制流动 (Dynamic Control Flow in Large-Scale Machine Learning)

Yuan Yu,Martín Abadi,Paul Barham,Eugene Brevdo,Mike Burrows,Andy Davis,Jeff Dean,Sanjay Ghemawat,Tim Harley,Peter Hawkins,Michael Isard,Manjunath Kudlur,Rajat Monga,Derek Murray,Xiaoqiang Zheng

from arxiv, Appeared in EuroSys 2018. 14 pages, 16 figures

Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability.

翻译：最近许多机器学习模型依靠细微的动态控制流动进行培训和推断。特别是,基于经常性神经网络和强化学习模型的模型取决于重复关系、数据依赖的有条件执行,以及需要动态控制流动的其他特征。这些应用程序受益于在一个分布式系统中对一组计算机设备作出快速控制流程决定的能力。对于性能、可缩放性和表达性,一个机器学习系统必须支持分布式和异质环境中的动态控制流动。本文为分布式机器学习提供了一个程序模型,支持动态控制流动。我们描述了程序模型的设计及其在分布式机器学习系统TensorFlow(一个分布式机器学习系统)中的实施。我们的方法扩大了数据流图的使用,以代表机器学习模式,提供了一些不同的特性。首先,对于一个分布式计算机设备,即一个功能、可缩放和机体的分支可以隔开许多机器,在分布式设备中运行,包括CPU、GPUS和定制的ASICT。第二,我们模型中写入的程序支持了自动区分和分布式计算,这是培训机器学习模型模型模型模型模型模型模型的设计和配置模型,这是必要的。第三阶段,我们选择的系统运行中所使用的系统运行和系统。

相关内容

Machine Learning

关注 2240

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

专知会员服务

170+阅读 · 2020年5月10日

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

专知会员服务

93+阅读 · 2020年5月6日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【机器学习最优化课程笔记】Optimization for Machine Learning，36页pdf

专知会员服务

117+阅读 · 2020年3月25日