Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.
翻译:近年来,深度学习取得了许多进展,主要依赖大规模Transformer实现的学习能力。然而,Transformer的核心构造块注意力操作符具有与序列长度二次方成正比的计算量,限制了可访问的上下文范围。现有的基于低秩和稀疏逼近的次二次方法需要与密集注意力层相结合,以匹配Transformer,这表明存在能力差距。在本文中,我们提出了豺狼社会结构(Hyena),它是一个次二次的注意力替代品,通过交错隐式参数化的长卷积和数据控制门的方式构建。在对数千到数十万个令牌的序列进行回归和推理任务时,豺狼社会结构在准确性方面比依赖状态空间和其他隐式和显式方法的操作符提高了50多个分数,而且与基于注意力的模型相当。在标准数据集(WikiText103和The Pile)上,我们在不需要密集注意力的架构上创造了新的最先进技术水平,在序列长度为2K时,训练计算量减少了20%,达到了Transformer的质量。在序列长度为8K时,豺狼社会结构操作速度是高度优化的注意力的两倍,在序列长度为64K时是后者的100倍。