Neural machine translation - using neural networks to translate human language - is an area of active research exploring new neuron types and network topologies with the goal of dramatically improving machine translation performance. Current state-of-the-art approaches, such as the multi-head attention-based transformer, require very large translation corpuses and many epochs to produce models of reasonable quality. Recent attempts to parallelize the official TensorFlow "Transformer" model across multiple nodes have hit roadblocks due to excessive memory use and resulting out of memory errors when performing MPI collectives. This paper describes modifications made to the Horovod MPI-based distributed training framework to reduce memory usage for transformer models by converting assumed-sparse tensors to dense tensors, and subsequently replacing sparse gradient gather with dense gradient reduction. The result is a dramatic increase in scale-out capability, with CPU-only scaling tests achieving 91% weak scaling efficiency up to 1200 MPI processes (300 nodes), and up to 65% strong scaling efficiency up to 400 MPI processes (200 nodes) using the Stampede2 supercomputer.
翻译:神经机器翻译 — 使用神经网络翻译人文语言 — 是一个积极研究的领域,探索新的神经型和网络型态,以大幅提高机器翻译性能为目标。 目前最先进的方法,如多头关注型变压器,需要非常庞大的翻译机和许多时代才能产生质量合理的模型。 最近在多个节点上将官方TensorFlow“ Transfor”模型平行化的尝试由于过度使用记忆和在进行 MPI 集成时出现记忆错误而遇到了路障。 本文描述了对基于 Horovod MPI 的分布式培训框架所作的修改, 以减少变压器模型的记忆用量, 其方法是将假设的发压器转换为密度高的发压器, 并随后以密度梯度减低的方式取代稀释的稀释梯度。 其结果是, 升级能力大幅提升, 仅使用CPU的升级测试达到91%的微缩缩缩缩缩到 1200 MPI 进程( 300 节点), 并高达65% 将效率大幅提升到400 MPI 进程 (200 节点) 。