Attention mechanism is a significant part of Transformer models. It helps extract features from embedded vectors by adding global information and its expressivity has been proved to be powerful. Nevertheless, the quadratic complexity restricts its practicability. Although several researches have provided attention mechanism in sparse form, they are lack of theoretical analysis about the expressivity of their mechanism while reducing complexity. In this paper, we put forward Random Batch Attention (RBA), a linear self-attention mechanism, which has theoretical support of the ability to maintain its expressivity. Random Batch Attention has several significant strengths as follows: (1) Random Batch Attention has linear time complexity. Other than this, it can be implemented in parallel on a new dimension, which contributes to much memory saving. (2) Random Batch Attention mechanism can improve most of the existing models by replacing their attention mechanisms, even many previously improved attention mechanisms. (3) Random Batch Attention mechanism has theoretical explanation in convergence, as it comes from Random Batch Methods on computation mathematics. Experiments on large graphs have proved advantages mentioned above. Also, the theoretical modeling of self-attention mechanism is a new tool for future research on attention-mechanism analysis.
翻译:注意力机制是Transformer模型的重要组成部分,它通过融入全局信息来帮助从嵌入向量中提取特征,其表达能力已被证明十分强大。然而,其二次复杂度限制了实际应用。尽管已有研究提出了稀疏形式的注意力机制,但在降低复杂度的同时,这些方法缺乏对其表达能力理论分析。本文提出随机批注意力(Random Batch Attention,RBA),一种具有线性复杂度的自注意力机制,并提供了保持其表达能力的理论支撑。随机批注意力具有以下显著优势:(1)随机批注意力具有线性时间复杂度,此外,它可在新维度上并行实现,从而显著节省内存。(2)随机批注意力机制可通过替换现有模型的注意力模块来改进多数模型,包括许多先前改进的注意力机制。(3)随机批注意力机制具有收敛性的理论解释,因其源于计算数学中的随机批处理方法。在大规模图数据上的实验验证了上述优势。同时,自注意力机制的理论建模为未来注意力机制分析研究提供了新工具。