This paper presents Tofu, a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators in order to work transparently with a general-purpose deep learning platform like MXNet. In order to automatically partition each operator, we propose to describe the semantics of an operator in a simple language which represents tensors as lambda functions mapping from tensor coordinates to values. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves 25% - 400% speedup over alternative approaches to train very large models.
翻译:本文展示了豆腐, 这个系统通过多个 GPU 设备分割非常大的 DNN 模型, 以减少每个 GPU 的内存足迹。 豆腐设计用于分割一个微粒操作员的数据流图, 以便与像 MXNet 这样的通用深层学习平台透明地合作。 为了自动分割每个操作员, 我们提议用一种简单的语言描述操作员的语义, 它代表了从 Exor 坐标向 值 的 Exgard 函数 。 为了优化地分隔数据流图中不同的操作员, Tofu 使用一种循环搜索算法, 以最大限度地减少通信总成本 。 我们在 8 GPU 机器上进行的实验显示, 豆腐可以对非常大的CNN 和 RNN 模型进行培训 。 此外, 25- 400% 的速率也超过了用于培训非常大型模型的替代方法 。