Transformer models have obtained remarkable accomplishments in various NLP tasks. However, these models have efficiency issues on long sequences, as the complexity of their self-attention module scales quadratically with the sequence length. To remedy the limitation, we present Memformer, a novel language model that utilizes a single unified memory to encode and retrieve past information. It includes a new optimization scheme, Memory Replay Back-Propagation, which promotes long-range back-propagation through time with a significantly reduced memory requirement. Memformer achieves $\mathcal{O}(n)$ time complexity and $\mathcal{O}(1)$ space complexity in processing long sequences, meaning that the model can handle an infinite length sequence during inference. Our model is also compatible with other self-supervised tasks to further improve the performance on language modeling. Experimental results show that Memformer outperforms the previous long-range sequence models on WikiText-103, including Transformer-XL and compressive Transformer.
翻译:变换模型在各种NLP任务中取得了显著成就。 但是, 这些模型在长序列上存在效率问题, 因为它们的自注意模块的复杂度随序列长度而以等宽度衡量。 为了纠正限制, 我们提出MemEx, 这是一种新颖的语言模型, 使用单一的统一记忆来编码和检索过去的信息。 它包括一个新的优化方案, 内存重放后推进, 随着时间的推移, 内存要求会大大降低, 从而推动长程后反向调整 。 元数据在处理长序列中实现了 $\mathcal{O}(n) 时间复杂性和$\mathcal{O}(1) 空间复杂性, 意思是模型在推断过程中可以处理无限的长度序列。 我们的模型也与其他自强任务兼容, 以进一步改进语言模型的性能。 实验结果表明, 元数据比WikiText- 103 上以前的长序列模型( 包括变换器- XL 和压缩变换器) 。