We generalize deep self-attention distillation in MiniLM (Wang et al., 2020) by only using self-attention relation distillation for task-agnostic compression of pretrained Transformers. In particular, we define multi-head self-attention relations as scaled dot-product between the pairs of query, key, and value vectors within each self-attention module. Then we employ the above relational knowledge to train the student model. Besides its simplicity and unified principle, more favorably, there is no restriction in terms of the number of student's attention heads, while most previous work has to guarantee the same head number between teacher and student. Moreover, the fine-grained self-attention relations tend to fully exploit the interaction knowledge learned by Transformer. In addition, we thoroughly examine the layer selection strategy for teacher models, rather than just relying on the last layer as in MiniLM. We conduct extensive experiments on compressing both monolingual and multilingual pretrained models. Experimental results demonstrate that our models distilled from base-size and large-size teachers (BERT, RoBERTa and XLM-R) outperform the state-of-the-art.
翻译:在MiniLM(Wang等人,2020年)中,我们仅使用自我关注关系蒸馏法来对未受过训练的变异器进行任务-不可知压缩,将深度自省蒸馏法推广到MiniLM(Wang等人,2020年)中。特别是,我们把多头自我关注关系定义为每个自省模块内对质、钥匙和价值矢量的对方之间的比例化点产品。然后,我们利用上述关系知识来培训学生模式。除了简单和统一的原则外,更有利的是,对于学生关注的负责人人数没有限制,而以前的大部分工作则必须保证师生之间的头数相同。此外,精细的自我关注关系往往充分利用变异器所学到的互动知识。此外,我们彻底审查了教师模型的层选择战略,而不是仅仅像在MiniLM那样依赖最后一个层。我们进行了关于压缩单语和多语言预培训模式的广泛实验。实验结果表明,我们的模型是从大小型教师(BERTER、ROBERTARTA和XLM-RM)中提取的。