Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules. To address this limitation, Linformer and Informer are proposed to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection respectively. These two models are intrinsically connected, and to understand their connection, we introduce a theoretical framework of matrix sketching. Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention with three carefully designed components: column sampling, adaptive row normalization and pilot sampling reutilization. Experiments on the Long Range Arena (LRA) benchmark demonstrate that our methods outperform alternatives with a consistently smaller time/space footprint.
翻译:以变换器为基础的模型在处理长序列方面效率不高,因为自留模块具有四边空间和时间复杂性。为解决这一局限性,建议Linfer和Ininfer分别通过低维投影和行选,将四边复杂程度降低到线性(modulo对数系数),这两个模型有着内在的联系,并且为了理解它们之间的联系,我们引入了一个矩阵草图的理论框架。根据理论分析,我们建议Skeinrent加快自留,进一步提高自留矩阵近似的准确性,并采用三个精心设计的部件:列取样、适应性行正常化和试点采样再利用。对长距离(LArena)基准的实验表明,我们的方法在时间/空间足迹上持续较小,优于其他方法。