Attention mechanisms form the backbone of state-of-the-art machine learning models for a variety of tasks. Deploying them on deep neural network (DNN) accelerators, however, is prohibitively challenging especially under long sequences. Operators in attention layers exhibit limited reuse and quadratic growth in memory footprint, leading to severe memory-boundedness. This paper introduces a new attention-tailored dataflow, termed FLAT, which leverages operator fusion, loop-nest optimizations, and interleaved execution. It increases the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer and thus achieves better run time and compute resource utilization. We term FLAT-compatible accelerators ATTACC. In our evaluation, ATTACC achieves 1.94x and 1.76x speedup and 49% and 42% of energy reduction comparing to state-of-the-art edge and cloud accelerators.
翻译:关注机制是各种任务最先进的机器学习模型的支柱。 但是,在深神经网络(DNN)加速器(DNN)加速器上部署它们尤其具有巨大的挑战性。 关注层的操作员在记忆足迹上表现出有限的再利用和二次增长,导致严重的记忆束缚。 本文介绍了一种新的关注量数据流, 称为FLAT, 利用操作员的聚合、 循环内优化和间断执行。 它通过有效利用高带宽、 低容量的芯片缓冲器来增加有效的记忆带宽, 从而实现更好的运行时间和计算资源的利用。 我们称之为FLAT- 兼容加速器ATACT。 在我们的评估中, ATACC 实现了1.94x 和 1.76x 速度, 以及 49% 和 42% 的能源减少量, 与最先进的边缘和云加速器相比。