VIP内容

像BERT和RoBERTa这样的预训练语言模型,尽管在许多自然语言处理任务中功能强大,但在计算和内存方面都很昂贵。为了缓解这个问题,一种方法是在部署之前对特定任务进行压缩。然而,最近的BERT压缩工作通常将大的BERT模型压缩到一个固定的更小的尺寸,并不能完全满足不同硬件性能的不同边缘器件的要求。在本文中,我们提出了一种新的动态BERT模型(简称DynaBERT),它可以在自适应的宽度和深度上运行。DynaBERT的训练过程包括首先训练一个宽度自适应的BERT,然后通过从全尺寸的模型中提取知识到小的子网络中,允许自适应的宽度和深度。网络重布线也被用来让更多的子网络共享更重要的注意力头部和神经元。在各种效率约束下的综合实验表明,我们提出的动态BERT(或RoBERTa)在其最大尺寸下的性能与BERT(或RoBERTa)相当,而在较小的宽度和深度下,动态BERT(或RoBERTa)的性能始终优于现有的BERT压缩方法。

成为VIP会员查看完整内容
0
20

最新论文

A computationally expensive and memory intensive neural network lies behind the recent success of language representation learning. Knowledge distillation, a major technique for deploying such a vast language model in resource-scarce environments, transfers the knowledge on individual word representations learned without restrictions. In this paper, inspired by the recent observations that language representations are relatively positioned and have more semantic knowledge as a whole, we present a new knowledge distillation objective for language representation learning that transfers the contextual knowledge via two types of relationships across representations: Word Relation and Layer Transforming Relation. Unlike other recent distillation techniques for the language models, our contextual distillation does not have any restrictions on architectural changes between teacher and student. We validate the effectiveness of our method on challenging benchmarks of language understanding tasks, not only in architectures of various sizes, but also in combination with DynaBERT, the recently proposed adaptive size pruning method.

0
0
下载
预览
Top