Recent works show that hybrid architectures combining sliding window softmax attention layers with linear recurrent neural network (RNN) layers outperform both of these architectures taken separately. However, the impact of the window length and the interplay between softmax attention and linear RNN layers remain under-studied. In this work, we introduce SWAX, a hybrid architecture consisting of sliding-window attention and xLSTM linear RNN layers. A counter-intuitive finding with SWAX is that larger sliding windows do not improve the long-context performance. In fact, short window attention encourages the model to better train the long-term memory of the xLSTM, by relying less on the softmax attention mechanism for long context-retrieval. The issue with small sliding windows is that they are detrimental for short-context tasks, which could be solved with information from moderately larger sliding windows otherwise. Therefore, we train SWAX by stochastically changing the sliding window size, forcing the model to leverage both a longer context window and the xLSTM memory. SWAX trained with stochastic window sizes significantly outperforms regular window attention both on short and long-context problems.
翻译:近期研究表明,将滑动窗口softmax注意力层与线性循环神经网络(RNN)层相结合的混合架构,其性能优于单独使用这两种架构。然而,窗口长度的影响以及softmax注意力与线性RNN层之间的相互作用仍未得到充分研究。本文提出SWAX——一种由滑动窗口注意力和xLSTM线性RNN层构成的混合架构。SWAX的一个反直觉发现是:增大滑动窗口并不能提升长上下文性能。实际上,短窗口注意力通过减少对softmax注意力机制的长上下文检索依赖,促使模型更好地训练xLSTM的长时记忆能力。小滑动窗口的缺陷在于会损害短上下文任务性能,而中等尺寸的滑动窗口本可提供解决该问题的信息。为此,我们采用随机变化滑动窗口尺寸的方式训练SWAX,迫使模型同时利用更长上下文窗口和xLSTM记忆。经随机窗口尺寸训练的SWAX在短上下文与长上下文问题上均显著优于常规窗口注意力模型。