Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an \textit{Associative Memory} interpretation, its difference-style update renders the training objective effectively \emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to \emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-\underline{Gated} (\underline{F}lash) \underline{W}indowed \underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.
翻译:现代自回归模型依赖于注意力机制,但Transformer中的Softmax全注意力随序列长度呈二次方复杂度增长。滑动窗口注意力通过约束注意力模式实现了线性时间的编码/解码,但在关联记忆的视角下,其差分式更新导致训练目标实际上无界。相比之下,Softmax注意力通过归一化更新会导致记忆收缩与梯度消失。本文提出GatedFWA:一种记忆门控的闪存窗口注意力机制,在保持滑动窗口注意力效率的同时,稳定记忆更新并使梯度流动可控。本质上,GatedFWA将每个词元/注意力头的门控值累积为衰减偏置,添加到注意力对数中,作为记忆递归中的可学习收缩因子。我们实现了融合的单次门控预处理流程,并开发了与FlashAttention兼容的内核,在滑动掩码下注入门控,确保I/O效率与数值稳定性。在语言建模基准测试中,GatedFWA以可忽略的开销实现了具有竞争力的吞吐量,更好地利用了全局上下文,且能无缝集成NSA等词元压缩/选择方法,并可推广至各类自回归任务领域。