Brimer:为语言建模搜索高效变换器 (Primer: Searching for Efficient Transformers for Language Modeling)

Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is performed at a lower level, over the primitives that define a Transformer TensorFlow program. We identify an architecture, named Primer, that has a smaller training cost than the original Transformer and other variants for auto-regressive language modeling. Primer's improvements can be mostly attributed to two simple modifications: squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. Experiments show Primer's gains over Transformer increase as compute scale grows and follow a power law with respect to quality at optimal model sizes. We also verify empirically that Primer can be dropped into different codebases to significantly speed up training without additional tuning. For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X. Furthermore, the reduced training cost means Primer needs much less compute to reach a target one-shot performance. For instance, in a 1.9B parameter configuration similar to GPT-3 XL, Primer uses 1/3 of the training compute to achieve the same one-shot performance as Transformer. We open source our models and several comparisons in T5 to help with reproducibility.

翻译：大型变换模型是最近自然语言处理进展的核心。然而, 这些模型的培训和推论成本迅速增长, 并且变得极其昂贵。在这里, 我们的目标是通过寻找一个效率更高的变异物来降低变异器的成本。与以前的方法相比, 我们的搜索是在较低的层次上进行, 与那些定义变异器 TensorFlow 程序的原始模型相比, 我们确定了一个名为 Primer 的建筑, 其培训成本比原始变异器和自动递减语言模型的其他变异器要低一些。初始器的改进主要归功于两个简单的修改: 匹配 ReLU 启动, 在每个Q、 K 和 V 后增加一个深度变异层。实验显示, 随着变异器增长, 在最优的模型大小上, 在质量上, 我们发现一个叫Primer 的架构, 可以降为不同的代码库, 大大加快培训速度, 而不做更多的调试调。例如, 在500Mir 改进 ReL 3 启动并增加一个深度的变异的变换层。, 将原 T5L 的变变变变变变换模型化为在 C4 格式上, 我们的变换的变换的变换的变换的变式的变换的变换版的变换版的变换版的变换版, 以降低化为 Cx 的变式的变式的变式的变式的变式的变式的变式的变法。