使用临时服务器加速深学习 (Speeding up Deep Learning with Transient Servers) - 专知论文

会员服务 ·

0

簇 · 服务器 · GPU · 可约的 · 学成 ·

2019 年 5 月 5 日

Speeding up Deep Learning with Transient Servers

翻译：使用临时服务器加速深学习

Shijian Li,Robert J. Walls,Lijie Xu,Tian Guo

from arxiv, Accepted to ICAC'19. 11 pages, 8 figures, 5 tables

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating new model designs---they often come with significantly higher monetary costs due to sublinear scalability. In this paper, we investigate the feasibility of using training clusters composed of cheaper transient GPU servers to get the benefits of distributed training without the high costs. We conduct the first large-scale empirical analysis, launching more than a thousand GPU servers of various capacities, aimed at understanding the characteristics of transient GPU servers and their impact on distributed training performance. Our study demonstrates the potential of transient servers with a speedup of 7.7X with more than 62.9% monetary savings for some cluster configurations. We also identify a number of important challenges and opportunities for redesigning distributed training frameworks to be transient-aware. For example, the dynamic cost and availability characteristics of transient servers suggest the need for frameworks to dynamically change cluster configurations to best take advantage of current conditions.

翻译：提议采用TensorFlow等分布式培训框架,作为利用一组GPU服务器减少深层次学习模式培训时间的手段。虽然这种加速往往是可取的,例如,快速评价新的模型设计往往由于子线性缩放能力而带来较高的货币成本。在本文件中,我们研究了使用由更廉价的瞬间GPU服务器组成的培训集群的可行性,以便在不产生高成本的情况下获得分布式培训的好处。我们进行了第一次大规模的经验分析,发射了1,000多个各种能力的GPU服务器,目的是了解瞬间GPU服务器的特点及其对分布式培训绩效的影响。我们的研究显示,由于某些集群配置节省了62.9%以上的货币,具有超速7.7X的瞬间服务器的潜力。我们还查明了重新设计分布式培训框架以适应当前条件的一些重要挑战和机会。例如,移动式服务器的动态成本和可用性特点表明,有必要建立框架,以动态地改变集群配置,以充分利用当前条件。

0

相关内容

MIT-深度学习Deep Learning State of the Art in 2020，87页ppt

MIT-深度学习Deep Learning State of the Art in 2020，87页ppt

专知会员服务

61+阅读 · 2020年2月17日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

176+阅读 · 2020年2月1日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

52+阅读 · 2020年1月30日

【DLBM-SS暑期课程】深度学习与贝叶斯方法 Deep Learning and Bayesian Methods

【DLBM-SS暑期课程】深度学习与贝叶斯方法 Deep Learning and Bayesian Methods

专知会员服务

64+阅读 · 2019年11月10日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

45+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

56+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

30+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

143+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

167+阅读 · 2019年10月11日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

19+阅读 · 2019年5月24日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

25+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

15+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

17+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

RL 真经

CreateAMind

5+阅读 · 2018年12月28日

【论文推荐】最新五篇图像分割相关论文—R2U-Net、ScatterNet混合深度学习、分离卷积编解码、控制、Embedding

【论文推荐】最新五篇图像分割相关论文—R2U-Net、ScatterNet混合深度学习、分离卷积编解码、控制、Embedding

专知

7+阅读 · 2018年2月26日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

【今日新增】IEEE Trans.专刊截稿信息8条

【今日新增】IEEE Trans.专刊截稿信息8条

Call4Papers

7+阅读 · 2017年6月29日

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

Arxiv

7+阅读 · 2020年3月12日

Meta Learning for End-to-End Low-Resource Speech Recognition

Meta Learning for End-to-End Low-Resource Speech Recognition

Arxiv

20+阅读 · 2019年10月26日

Deep learning for time series classification: a review

Arxiv

12+阅读 · 2019年3月14日

Accelerated Methods for Deep Reinforcement Learning

Accelerated Methods for Deep Reinforcement Learning

Arxiv

6+阅读 · 2019年1月10日

Learning to Walk via Deep Reinforcement Learning

Arxiv

7+阅读 · 2018年12月26日

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

Arxiv

4+阅读 · 2018年10月24日

A Survey on Deep Transfer Learning

A Survey on Deep Transfer Learning

Arxiv

11+阅读 · 2018年8月6日

The Bottleneck Simulator: A Model-based Deep Reinforcement Learning Approach

The Bottleneck Simulator: A Model-based Deep Reinforcement Learning Approach

Arxiv

11+阅读 · 2018年7月12日

Large Margin Few-Shot Learning

Arxiv

11+阅读 · 2018年7月8日

Learning to Update for Object Tracking

Arxiv

7+阅读 · 2018年6月19日

VIP会员

文章信息

相关主题

相关VIP内容

MIT-深度学习Deep Learning State of the Art in 2020，87页ppt

MIT-深度学习Deep Learning State of the Art in 2020，87页ppt

专知会员服务

61+阅读 · 2020年2月17日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

176+阅读 · 2020年2月1日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

52+阅读 · 2020年1月30日

【DLBM-SS暑期课程】深度学习与贝叶斯方法 Deep Learning and Bayesian Methods

【DLBM-SS暑期课程】深度学习与贝叶斯方法 Deep Learning and Bayesian Methods

专知会员服务

64+阅读 · 2019年11月10日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

45+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

56+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

30+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

143+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

167+阅读 · 2019年10月11日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

热门VIP内容

相关资讯

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

19+阅读 · 2019年5月24日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

25+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

15+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

17+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

RL 真经

CreateAMind

5+阅读 · 2018年12月28日

【论文推荐】最新五篇图像分割相关论文—R2U-Net、ScatterNet混合深度学习、分离卷积编解码、控制、Embedding

【论文推荐】最新五篇图像分割相关论文—R2U-Net、ScatterNet混合深度学习、分离卷积编解码、控制、Embedding

专知

7+阅读 · 2018年2月26日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

【今日新增】IEEE Trans.专刊截稿信息8条

【今日新增】IEEE Trans.专刊截稿信息8条

Call4Papers

7+阅读 · 2017年6月29日

相关论文

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

Arxiv

7+阅读 · 2020年3月12日

Meta Learning for End-to-End Low-Resource Speech Recognition

Meta Learning for End-to-End Low-Resource Speech Recognition

Arxiv

20+阅读 · 2019年10月26日

Deep learning for time series classification: a review

Arxiv

12+阅读 · 2019年3月14日

Accelerated Methods for Deep Reinforcement Learning

Accelerated Methods for Deep Reinforcement Learning

Arxiv

6+阅读 · 2019年1月10日

Learning to Walk via Deep Reinforcement Learning

Arxiv

7+阅读 · 2018年12月26日

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

Arxiv

4+阅读 · 2018年10月24日

A Survey on Deep Transfer Learning

A Survey on Deep Transfer Learning

Arxiv

11+阅读 · 2018年8月6日

The Bottleneck Simulator: A Model-based Deep Reinforcement Learning Approach

The Bottleneck Simulator: A Model-based Deep Reinforcement Learning Approach

Arxiv

11+阅读 · 2018年7月12日

Large Margin Few-Shot Learning

Arxiv

11+阅读 · 2018年7月8日

Learning to Update for Object Tracking

Arxiv

7+阅读 · 2018年6月19日

微信扫码咨询专知VIP会员