你的变形器也许没有你所期望的那么强大 (Your Transformer May Not be as Powerful as You Expect)

Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative -- RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications.

翻译：相对位置编码( RPE) 将任何一对象征之间的相对距离编码为代号,这是对原始变异器最成功的修改之一。据我们所知,对基于 RPE 的变异器的理论理解在很大程度上是没有探索的。在这项工作中,我们数学分析基于 RPE 的变异器的力量,即该模型是否能够接近任何连续序列至序列的功能。人们自然会认为答案是肯定的 -- -- 以 RPE 为基础的变异器是通用功能匹配器。然而,我们通过显示存在连续的序列到序列的变异器,我们呈现出一个负面的结果,即基于 RPE 的变异器无法估计其神经网络的深度和广度。一个关键原因在于,大多数变异器被置于始终能产生正确分解矩阵的软性轴中。这限制了网络在 RPE 中获取定位器定位信息的能力,并限制其能力。为了克服问题和使模型更加强大,我们首先为基于 RPE 的变异器的变异器的变异器提供了足够的条件, 实现全局性功能的全局性功能。。理论指导是, 我们的SLPE AS 的模型的模型的模型的模型, 的模型的模型被称作的模型被称作的模型被称作的模型被提升的模型被提升的模型被称作的模型被定位的模型被称作的模型的模型被称作为。