DCT-前期: 高效的对分辨的 Cosine 变换自我注意</s> (DCT-Former: Efficient Self-Attention with Discrete Cosine Transform)

Since their introduction the Trasformer architectures emerged as the dominating architectures for both natural language processing and, more recently, computer vision applications. An intrinsic limitation of this family of "fully-attentive" architectures arises from the computation of the dot-product attention, which grows both in memory consumption and number of operations as $O(n^2)$ where $n$ stands for the input sequence length, thus limiting the applications that require modeling very long sequences. Several approaches have been proposed so far in the literature to mitigate this issue, with varying degrees of success. Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module by leveraging the properties of the Discrete Cosine Transform. An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time. This makes it particularly suitable in real-time contexts on embedded platforms. Moreover, we assume that the results of our research might serve as a starting point for a broader family of deep neural models with reduced memory footprint. The implementation will be made publicly available at https://github.com/cscribano/DCT-Former-Public

翻译：自引入以来,Traserf 结构作为自然语言处理和最近计算机视觉应用的主导结构而出现。“完全注意”结构这一组“完全注意”结构的内在局限性来自对点产品注意力的计算,即记忆消耗量增加,操作次数增加为$O(n)2美元,而美元代表输入序列长度,从而限制需要建模非常长的序列的应用。文献中迄今提出了几种方法来缓解这一问题,取得了不同程度的成功。我们的想法从丢失数据压缩的世界(如JPEG算法)中得到灵感,通过利用Discrete Cosine变形的特性来接近关注模块。大量实验显示,我们的方法为同一性能的记忆增加较少,同时也极大地缩短了推导时间。这在嵌入平台的实时环境中特别合适。此外,我们的研究成果可以作为一个更广阔的深层神经模型(如JPEPEG 算算法)的起始点,而记忆足迹减少。实施过程将在 https://mergres/Forbarmas/FormanoDC上公开提供。</s>