Transformer-based language models (LMs) are at the core of modern NLP, but their internal prediction construction process is opaque and largely not understood. In this work, we make a substantial step towards unveiling this underlying prediction process, by reverse-engineering the operation of the feed-forward network (FFN) layers, one of the building blocks of transformer models. We view the token representation as a changing distribution over the vocabulary, and the output from each FFN layer as an additive update to that distribution. Then, we analyze the FFN updates in the vocabulary space, showing that each update can be decomposed to sub-updates corresponding to single FFN parameter vectors, each promoting concepts that are often human-interpretable. We then leverage these findings for controlling LM predictions, where we reduce the toxicity of GPT2 by almost 50%, and for improving computation efficiency with a simple early exit rule, saving 20% of computation on average.
翻译:以变换器为基础的语言模型(LMS)是现代NLP的核心,但其内部预测构建过程不透明,基本上不为人们所理解。 在这项工作中,我们迈出了一大步,通过逆向工程来揭开这一基础预测过程,即变压器模型的构件之一,即进料网络(FFN)层的运行。我们认为象征性的表示方式是词汇的分布变化,而FFFFFN层的输出则是该分布的添加性更新。 然后,我们分析了词汇空间中的FFFFFN更新,显示每份更新可以与FFFFN参数矢量的子更新相匹配,每个促进的概念通常都是人际解释的。 然后我们利用这些发现来控制LM预测,将GPT2的毒性降低近50%,用简单的早期退出规则来提高计算效率,平均节省20%的计算。