State-of-the-art transformer models use pairwise dot-product based self-attention, which comes at a computational cost quadratic in the input sequence length. In this paper, we investigate the global structure of attention scores computed using this dot product mechanism on a typical distribution of inputs, and study the principal components of their variation. Through eigen analysis of full attention score matrices, as well as of their individual rows, we find that most of the variation among attention scores lie in a low-dimensional eigenspace. Moreover, we find significant overlap between these eigenspaces for different layers and even different transformer models. Based on this, we propose to compute scores only for a partial subset of token pairs, and use them to estimate scores for the remaining pairs. Beyond investigating the accuracy of reconstructing attention scores themselves, we investigate training transformer models that employ these approximations, and analyze the effect on overall accuracy. Our analysis and the proposed method provide insights into how to balance the benefits of exact pair-wise attention and its significant computational expense.
翻译:最先进的变压器模型使用双元点产品自控, 其计算成本是输入序列长度的二次方位。 在本文中, 我们调查使用这种投入典型分布的点产品机制计算出的全球关注分数结构, 并研究其变异的主要组成部分。 通过对全重分数矩阵及其单行的全重分数分析, 我们发现, 关注分数的多数差异在于一个低维电子空间。 此外, 我们发现不同层甚至不同变压器模型的这些电子空间之间有很大的重叠。 基于这一点, 我们建议只计算部分象征性配对的分数, 并用它们来估计剩余对对子的分数。 除了调查重整注意的准确性分数本身, 我们调查使用这些近似值的培训变压器模型, 并分析其对总体准确性的影响。 我们的分析和拟议方法为如何平衡精确对称注意的好处及其重大计算成本提供了深刻的洞察力。