内存变换器 (Memory Transformer)

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware representations. However, information about the context is stored mostly in the same element-wise representations. This might limit the processing of properties related to the sequence as a whole more difficult. Adding trainable memory to selectively store local as well as global representations of a sequence is a promising direction to improve the Transformer model. Memory-augmented neural networks (MANNs) extend traditional neural architectures with general-purpose memory for representations. MANNs have demonstrated the capability to learn simple algorithms like Copy or Reverse and can be successfully trained via backpropagation on diverse tasks from question answering to language modeling outperforming RNNs and LSTMs of comparable complexity. In this work, we propose and study few extensions of the Transformer baseline (1) by adding memory tokens to store non-local representations, (2) creating memory bottleneck for the global information, (3) controlling memory update with dedicated layer. We evaluate these memory augmented Transformers and demonstrate that presence of memory positively correlates with the model performance for machine translation and language modelling tasks. Augmentation of pre-trained masked language model with memory tokens shows mixed results for tasks from GLUE benchmark. Visualization of attention patterns over the memory suggest that it improves the model's ability to process a global context.

翻译：以变异器为基础的模型在许多自然语言处理任务中取得了最先进的结果。自我注意结构使变异器能够将来自一个序列的所有要素的信息综合起来,形成上下文表达式。但是,关于上下文的信息大多存储在同一个元素的表示式中。这可能会限制与整个序列有关的属性的处理。添加可训练的记忆,有选择地存储本地和全球的序列表示式是改进变异器模型的一个有希望的方向。内存增强型神经网络(MANNs)将具有通用内存的传统的神经结构扩展为表达式。MANNs已经展示了学习复制或反转等简单算法的能力,并且可以通过对从回答问题到模拟语言超过性能的 RNNS和类似复杂性的LSTMs等不同任务进行反向调整来成功培训。在这项工作中,我们建议并研究变异器基线的延伸很少(1)通过添加记忆符号来存储非本地的表达式表达式表达式表达式,(2)用专用的内存模型来控制全球信息的更新。我们用这些记忆模型来评估变异或反法的记忆能力,并用G格式模拟模拟的内存结果模拟模拟模拟任务显示其具有正的内存结果的模化结果的模化。