像变形人一样思考 (Thinking Like Transformers)

What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel. In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language. We map the basic components of a transformer-encoder -- attention and feed-forward computation -- into simple primitives, around which we form a programming language: the Restricted Access Sequence Processing Language (RASP). We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer, and how a Transformer can be trained to mimic a RASP solution. In particular, we provide RASP programs for histograms, sorting, and Dyck-languages. We further use our model to relate their difficulty in terms of the number of required layers and attention heads: analyzing a RASP program implies a maximum number of heads and layers necessary to encode a task in a transformer. Finally, we see how insights gained from our abstraction might be used to explain phenomena seen in recent works.

翻译：变换器背后的计算模型是什么? 当反复出现的神经网络在有限的状态机器中具有直接相似之处,允许围绕建筑变异或经过训练的模型进行清晰的讨论和思考时,变换器没有如此熟悉的相似之处。在本文中,我们的目标是改变这一模式,以编程语言的形式为变压器-变压器编码器提出一个计算模型。我们将变压器-变压器的基本组成部分 -- -- 注意和进向计算 -- -- 映射成简单的原始体,我们形成一种编程语言:限制存取序列处理语言(RASP)。我们展示了RASP如何用程序来编程解决方案解决由变换器可以想象到的任务,以及如何训练变换器模拟变压器的解决方案。特别是,我们为变压器-变压器-变压器、分排序和Dyck-语言提供了RASP程序。我们进一步使用我们的模型将变压器的难度与所需层次和注意头数联系起来:分析RASP程序意味着在变换器中将任务编码为最多必要的头和层次。我们可以看到如何利用抽象现象。