聚聚关注 (Agglomerative Attention)

Neural networks using transformer-based architectures have recently demonstrated great power and flexibility in modeling sequences of many types. One of the core components of transformer networks is the attention layer, which allows contextual information to be exchanged among sequence elements. While many of the prevalent network structures thus far have utilized full attention -- which operates on all pairs of sequence elements -- the quadratic scaling of this attention mechanism significantly constrains the size of models that can be trained. In this work, we present an attention model that has only linear requirements in memory and computation time. We show that, despite the simpler attention model, networks using this attention mechanism can attain comparable performance to full attention networks on language modeling tasks.

翻译：使用以变压器为基础的结构的神经网络近来在建模许多类型的序列方面表现出巨大的力量和灵活性。变压器网络的核心组成部分之一是注意层,它使得在顺序要素之间能够交换背景信息。虽然迄今为止许多流行的网络结构已经充分利用了注意力 -- -- 这些注意力涉及所有对等的顺序要素 -- -- 但这种注意机制的四级缩放极大地限制了可以培训的模型的规模。在这项工作中,我们提出了一个注意模型,在记忆和计算时间方面只有线性要求。我们表明,尽管有较简单的注意模式,但使用这种注意机制的网络能够取得与充分注意的网络在语言建模任务方面的类似业绩。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【DeepMind深度学习课程】序列循环神经网络，141页ppt，Sequences and Recurrent Network

专知会员服务

83+阅读 · 2020年6月23日

【ICLR 2019】双曲注意力网络，Hyperbolic Attention Network

专知会员服务

82+阅读 · 2020年6月21日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

52+阅读 · 2020年1月30日

【南洋理工大学课程】注意力神经网络，Attention Neural Networks，附78页PPT

专知会员服务

154+阅读 · 2019年11月9日