Neural networks using transformer-based architectures have recently demonstrated great power and flexibility in modeling sequences of many types. One of the core components of transformer networks is the attention layer, which allows contextual information to be exchanged among sequence elements. While many of the prevalent network structures thus far have utilized full attention -- which operates on all pairs of sequence elements -- the quadratic scaling of this attention mechanism significantly constrains the size of models that can be trained. In this work, we present an attention model that has only linear requirements in memory and computation time. We show that, despite the simpler attention model, networks using this attention mechanism can attain comparable performance to full attention networks on language modeling tasks.
翻译:使用以变压器为基础的结构的神经网络近来在建模许多类型的序列方面表现出巨大的力量和灵活性。变压器网络的核心组成部分之一是注意层,它使得在顺序要素之间能够交换背景信息。虽然迄今为止许多流行的网络结构已经充分利用了注意力 -- -- 这些注意力涉及所有对等的顺序要素 -- -- 但这种注意机制的四级缩放极大地限制了可以培训的模型的规模。在这项工作中,我们提出了一个注意模型,在记忆和计算时间方面只有线性要求。我们表明,尽管有较简单的注意模式,但使用这种注意机制的网络能够取得与充分注意的网络在语言建模任务方面的类似业绩。