Turbo Transtrads:一个高效的GPU服务系统,用于变换模型 (TurboTransformers: An Efficient GPU Serving System For Transformer Models)

The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models, Transformers can process on dimensions of sequence lengths in parallel, therefore leading to better accuracy on long sequences. However, efficient deployments of them for online services in data centers equipped with GPUs are not easy. First, more computation introduced by transformer structures makes it more challenging to meet the latency and throughput constraints of serving. Second, NLP tasks take in sentences of variable length. The variability of input dimensions brings a severe problem to efficient memory management and serving optimization. This paper designed a transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework to solve the above challenges. Three innovative features make it stand out from other similar works. An efficient parallel algorithm is proposed for GPU-based batch reduction operations, like Softmax and LayerNorm, major hot spots besides BLAS routines. A memory allocation algorithm, which better balances the memory footprint and allocation/free efficiency, is designed for variable-length input situations. A serving framework equipped with a new batch scheduler using dynamic programming achieves the optimal throughput on variable-length requests. The system can achieve the state-of-the-art transformer model serving performance on GPU platforms and can be seamlessly integrated into your PyTorch code with a few lines of code.

翻译：变压器是近年来自然语言处理(NLP)领域最关键的算法创新。与经常性神经网络(NNN)模型不同, 变压器可以平行处理序列长度的维度, 从而导致长序列的准确性提高。然而, 在配备 GPU 的数据中心, 高效地部署这些变压器进行在线服务并不容易。首先, 变压器结构引入更多的计算方法, 更难以满足服务时间和吞吐量的限制。第二, NLP任务在变长的句号中。输入线的变异性给高效的记忆管理和服务优化带来严重问题。本文设计了一个变压器服务系统, 名为 Turbo Transfrents, 由计算运行时间和解决上述挑战的服务框架组成。三个创新功能使这些变压器在其他类似的工程中站点外。为基于 GPPPPP 的批量削减操作提出了一种高效的平行算法, 它可以在BLAS 常规之外的主要热点中进行。记忆分配算算法, 更好地平衡记忆足和分配/ 配置/ 分配算出一个可变压式的模型, 可以在可变压的系统上实现一个可变压式的系统格式化的系统上实现一个可变压式的版本。