A bottleneck in transformer architectures is their quadratic complexity with respect to the input sequence, which has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact sparse attention; however this approach still requires quadratic computation. In this paper, we propose Sparsefinder, a simple model trained to identify the sparsity pattern of entmax attention before computing it. We experiment with three variants of our method, based on distances, quantization, and clustering, on two tasks: machine translation (attention in the decoder) and masked language modeling (encoder-only). Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph. This allows for detailed comparison between different models, and may guide future benchmarks for sparse models.
翻译:变压器结构中的瓶颈是其输入序列的二次复杂程度,这促使人们就低效的微缩近似值进行大量工作。 内轴变压器使用的替代路径包括内置零散的注意; 然而,这个方法仍然需要四进制计算。 在本文中, 我们提出一个简单的模型Sprassfinder, 用来在计算前确定堆积注意的聚度模式。 我们实验了我们方法的三种变体, 其基础是距离、 量化和组合, 包括两个任务: 机器翻译( 脱coder 的注意) 和隐蔽语言建模( encoder- only) 。 我们的工作为研究模型效率提供了一个新角度, 其方法是广泛分析宽度和回顾预测的注意图之间的取舍。 这样可以对不同的模型进行详细比较, 并且可以指导稀少模型的未来基准 。