As giant dense models advance quality but require large-scale expensive GPU clusters for training, the sparsely gated Mixture-of-Experts (MoE), a kind of conditional computation architecture, are proposed to scale models while keeping the computation constant. Specifically, the input data is routed by a gate network and only activates a part of the expert network. Existing MoE training systems only support part of mainstream MoE models (e.g. Top k) training under expensive high-bandwidth GPU clusters. In this paper, we present HetuMoE, a high-performance large-scale sparse MoE training system built on Hetu. HetuMoE provides multiple gating strategies and efficient GPU kernel implementations. To further improve the training efficiency on commodity GPU clusters (e.g, with only 1 NiC), we introduce the hierarchical AllToAll communication that combines hierarchical networks and aggregating messages. Compared with existing state-of-the-art MoE systems, HetuMoE obtains at least 15% speedup. Specifically, HetuMoE outperforms DeepSpeed-MoE up to 8.1x under the switch gate with a batch size of 32. The code is available at: https://github.com/PKU-DAIR/Hetu.
翻译:由于巨型密度大的模型提高了质量,但需要大规模昂贵的GPU GPU集群培训,因此建议使用一种条件性计算架构(MOE),以模型规模化,同时保持计算常数。具体地说,输入数据由一个门网络路由,只是激活专家网络的一部分。现有的教育部培训系统只支持主流ME模型的一部分(如Topk)培训,而需要昂贵的高带宽GPU类培训。在本文中,我们介绍HetuMoE,一个在赫图上建立的高性能的大型分散的ME培训系统。HetuMoE提供多种定位战略和高效的GPU内核执行。为了进一步提高商品GPU集群的培训效率(如只有1 NC),我们引入了等级全方位通信,将等级网络和信息集成在一起。与现有的高频高频的MOE系统相比,HetMoE至少获得15%的速度。具体地说,HetMoE向深Speed-MUFA Gates,在可使用的AS-MO-DA Gates 上可转换到8.1A。