Baechi:机器学习图快速装置定位 (Baechi: Fast Device Placement of Machine Learning Graphs)

Machine Learning graphs (or models) can be challenging or impossible to train when either devices have limited memory, or models are large. To split the model across devices, learning-based approaches are still popular. While these result in model placements that train fast on data (i.e., low step times), learning-based model-parallelism is time-consuming, taking many hours or days to create a placement plan of operators on devices. We present the Baechi system, the first to adopt an algorithmic approach to the placement problem for running machine learning training graphs on small clusters of memory-constrained devices. We integrate our implementation of Baechi into two popular open-source learning frameworks: TensorFlow and PyTorch. Our experimental results using GPUs show that: (i) Baechi generates placement plans 654 X - 206K X faster than state-of-the-art learning-based approaches, and (ii) Baechi-placed model's step (training) time is comparable to expert placements in PyTorch, and only up to 6.2% worse than expert placements in TensorFlow. We prove mathematically that our two algorithms are within a constant factor of the optimal. Our work shows that compared to learning-based approaches, algorithmic approaches can face different challenges for adaptation to Machine learning systems, but also they offer proven bounds, and significant performance benefits.

翻译：当设备记忆有限或模型巨大时,机器学习图(或模型)可能具有挑战性或无法培训。为了将模型分成不同装置,学习方法仍然很受欢迎。虽然这些结果导致模型布置,快速培训数据(即低步数),但学习模型的平行主义模式耗费时间,需要许多小时或数日时间来创建装置操作员的安置计划。我们介绍Baechi系统,第一个在运行小型记忆限制装置集的机器学习培训图时,对安置问题采用算法方法。我们把Baechi的实施工作纳入两个受欢迎的开放源学习框架:TensorFlow和PyTorrch。我们使用GPUs的实验结果表明:(一) Baechi生成了654 X-206K X的布置计划,比目前最先进的学习方法更快,以及(二) Baechi-placed 模型的台阶(培训)时间与专家在PyTorch 中的位置安排相当,但仅达6.2%,比TensorFrch 的的专家布局更差。我们的最佳算算算算法方法可以显示我们最优的学习方法。

相关内容

Machine Learning

关注 2239

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

Meta最新WWW2022《联邦计算导论》教程，附77页ppt

专知会员服务

60+阅读 · 2022年5月5日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

77+阅读 · 2022年3月15日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日