Attention机制最早是在视觉图像领域提出来的,但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14],他们在RNN模型上使用了attention机制来进行图像分类。随后,Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中,使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行,他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近,如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

VIP内容

随着以自然为灵感的纯粹注意力模型,即transformer的出现,以及它们在自然语言处理(NLP)方面的成功,它们对机器视觉(MV)任务的扩展是不可避免的,而且感觉非常强烈。随后,视觉变换器(ViTs)的引入给现有的基于深度学习的机器视觉技术带来了挑战。然而,纯粹的基于注意力的模型/架构,如变换器,需要大量的数据、大量的训练时间和大量的计算资源。最近的一些工作表明,这两个不同领域的组合可以证明构建具有这两个领域的优点的系统。据此,这一现状的综述论文是介绍,希望将帮助读者得到有用的信息,这一有趣的和潜在的研究领域。首先介绍了注意力机制,然后讨论了流行的基于注意力的深度架构。随后,我们讨论了基于机器视觉的注意机制与深度学习交叉的主要类别。然后,讨论了本文研究范围内的主要算法、问题和发展趋势。

成为VIP会员查看完整内容
0
26

最新论文

The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention. Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT). With FFT, our method achieves $\mathcal{O}(n\log n)$ time complexity. Interestingly, we further demonstrate that properly using relative positional encoding can mitigate the training instability problem of vanilla kernelized attention. On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime.

0
1
下载
预览
子主题
Top