BERT is the most recent Transformer-based model that achieves state-of-the-art performance in various NLP tasks. In this paper, we investigate the hardware acceleration of BERT on FPGA for edge computing. To tackle the issue of huge computational complexity and memory footprint, we propose to fully quantize the BERT (FQ-BERT), including weights, activations, softmax, layer normalization, and all the intermediate results. Experiments demonstrate that the FQ-BERT can achieve 7.94x compression for weights with negligible performance loss. We then propose an accelerator tailored for the FQ-BERT and evaluate on Xilinx ZCU102 and ZCU111 FPGA. It can achieve a performance-per-watt of 3.18 fps/W, which is 28.91x and 12.72x over Intel(R) Core(TM) i7-8700 CPU and NVIDIA K80 GPU, respectively.
翻译:BERT是最近在各种NLP任务中实现最先进性能的最新变异模型。 在本文中,我们调查了FPGA上用于边缘计算的BERT硬件加速度。为了解决计算复杂度和记忆足迹巨大问题,我们提议对BERT(FQ-BERT)进行充分量化,包括重量、激活、软模、层正常化和所有中间结果。实验表明FQ-BERT能够达到7.94x重量压缩,但性能损失微不足道。然后我们建议为FQ-BERT设计一个加速器,并评价Xilinx ZCU102和ZCU111 FPGA。它能够实现3.18个fp/W的性能/瓦,即28.91x 12.72x对核心(R) i7-8700 CPU和 NVIDIA K80 GPU。