In recent years, large pre-trained Transformer networks have demonstrated dramatic improvements in many natural language understanding tasks. However, the huge size of these models brings significant challenges to their fine-tuning and online deployment due to latency and cost constraints. New hardware supporting both N:M semi-structured sparsity and low-precision integer computation is a promising solution to boost DNN model serving efficiency. However, there have been very few studies that systematically investigate to what extent pre-trained Transformer networks benefit from the combination of these techniques, as well as how to best compress each component of the Transformer. We propose a flexible compression framework NxMiFormer that performs simultaneous sparsification and quantization using ADMM and STE-based QAT. Furthermore, we present and inexpensive, heuristic-driven search algorithm that identifies promising heterogeneous compression configurations that meet a compression ratio constraint. When evaluated across the GLUE suite of NLU benchmarks, our approach can achieve up to 93% compression of the encoders of a BERT model while retaining 98.2% of the original model accuracy and taking full advantage of the hardware's capabilities. Heterogeneous configurations found the by the search heuristic maintain 99.5% of the baseline accuracy while still compressing the model by 87.5%.
翻译:近年来,经过培训的大型变异器网络在许多自然语言理解任务方面表现出了显著的改善。然而,这些模型的庞大规模由于长期性和成本限制,给其微调和在线部署带来了重大挑战。支持N:M半结构宽度和低精度整数计算的新硬件是提高DNN模型效率的一个大有希望的解决办法。然而,很少有研究系统地调查经过培训的变异器网络在多大程度上受益于这些技术的结合,以及如何最好地压缩变异器的每个组成部分。我们提议一个灵活的压缩框架NxMIFormer,利用ADMM和STE基QAT进行同步的回流和二次化。此外,我们提出和廉价的超自然驱动搜索算法,确定有希望的混合压缩组合组合,满足压缩比率的制约。在对GLUE的整套NLU基准进行评估时,我们的方法可以达到93%的压缩BERT模型的孵化器,同时保留98.2%的原始模型精度,并充分利用了硬件模型的精度,同时通过87 %的精确度搜索能力。5 Hegencial 5 保持了硬件的精确度。