The Transformer has become the de facto standard for modern language models owing to its parallelizable training and effective autoregressive decoding. However, its fixed context window and the quadratic time and memory costs of its self-attention mechanism remain central bottlenecks. These constraints have revived interest in recurrent architectures that scale linearly with sequence length, but at the cost of reduced parallelism. In this paper, we introduce Avey, a new foundational architecture that breaks away from both attention and recurrence. Avey pairs a ranker with an autoregressive neural processor to select and contextualize only the most relevant tokens for any given token. Specifically, it decouples sequence length from context width, thus enabling effective and efficient processing of arbitrarily long sequences. Results show that Avey compares favorably to the Transformer across a variety of standard short-range NLP benchmarks, while significantly outperforming it on tasks requiring long-range dependency modeling.
翻译:Transformer凭借其可并行化训练和高效的自回归解码能力,已成为现代语言模型的事实标准。然而,其固定的上下文窗口以及自注意力机制带来的二次时间与内存开销,仍然是核心瓶颈。这些限制重新激发了人们对循环架构的兴趣——此类架构虽能随序列长度线性扩展,却以牺牲并行性为代价。本文提出Avey,一种既摆脱注意力机制又无需循环结构的新型基础架构。Avey将排序器与自回归神经处理器相结合,仅为每个给定词元筛选并关联最相关的词元。具体而言,它解耦了序列长度与上下文宽度,从而实现对任意长序列的高效处理。实验结果表明,在多种标准短距离自然语言处理基准测试中,Avey与Transformer性能相当;而在需要长距离依赖建模的任务中,Avey则显著优于Transformer。