Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating new model designs---they often come with significantly higher monetary costs due to sublinear scalability. In this paper, we investigate the feasibility of using training clusters composed of cheaper transient GPU servers to get the benefits of distributed training without the high costs. We conduct the first large-scale empirical analysis, launching more than a thousand GPU servers of various capacities, aimed at understanding the characteristics of transient GPU servers and their impact on distributed training performance. Our study demonstrates the potential of transient servers with a speedup of 7.7X with more than 62.9% monetary savings for some cluster configurations. We also identify a number of important challenges and opportunities for redesigning distributed training frameworks to be transient-aware. For example, the dynamic cost and availability characteristics of transient servers suggest the need for frameworks to dynamically change cluster configurations to best take advantage of current conditions.
翻译:提议采用TensorFlow等分布式培训框架,作为利用一组GPU服务器减少深层次学习模式培训时间的手段。虽然这种加速往往是可取的,例如,快速评价新的模型设计往往由于子线性缩放能力而带来较高的货币成本。在本文件中,我们研究了使用由更廉价的瞬间GPU服务器组成的培训集群的可行性,以便在不产生高成本的情况下获得分布式培训的好处。我们进行了第一次大规模的经验分析,发射了1,000多个各种能力的GPU服务器,目的是了解瞬间GPU服务器的特点及其对分布式培训绩效的影响。我们的研究显示,由于某些集群配置节省了62.9%以上的货币,具有超速7.7X的瞬间服务器的潜力。我们还查明了重新设计分布式培训框架以适应当前条件的一些重要挑战和机会。例如,移动式服务器的动态成本和可用性特点表明,有必要建立框架,以动态地改变集群配置,以充分利用当前条件。