Machine learning models made up of millions or billions of parameters are often trained and served on large multi-GPU systems. As models grow in size and execute on more GPUs, the collective communications used in these applications becomes a bottleneck. Custom collective algorithms optimized for both particular network topologies and application specific communication patterns can alleviate this bottleneck and thus help these applications scale. This paper introduces MSCCL, a system designed to make GPU communication programmable. MSCCL provides a data oriented domain specific language for writing custom collective communication algorithms and an optimizing compiler for lowering them to an executable form, which can be executed efficiently and flexibly in an interpreter based runtime. We used MSCCL to write novel collective implementations for AllReduce and AllToAll that are up to 48% and 20% faster than optimized vendor implementations, respectively. We also demonstrate how directly implementing an application specific collective called AllToNext in MSCCL results in a 14.5 speedup over the baseline.
翻译:由数以百万或数十亿参数组成的机器学习模型往往经过培训,并用于大型的多参数系统。随着模型的大小扩大并在更多的GPU上执行,这些应用中所使用的集体通信成为瓶颈。为特定网络地形和应用特定通信模式优化的定制集体算法可以缓解这一瓶颈,从而帮助这些应用规模。本文件介绍了MSCCL, 该系统旨在使GPU通信编程成为可操作的系统。 MSCL为编写定制集体通信算法提供了以数据为导向的特定域语言,并为将其降格为可执行格式提供了最优化的编译员,可以在基于翻译的运行时高效和灵活地执行。我们使用MSCL为全红和全红公司编写了新的集体实施方法,分别比优化的供应商实施速度快48%和20%。我们还演示如何直接实施一个名为AllToNext的具体应用程序,在MSCLCL的结果中以14.5的速度加速到基线。