Deep pre-training and fine-tuning models (such as BERT and OpenAI GPT) have demonstrated excellent results in question answering areas. However, due to the sheer amount of model parameters, the inference speed of these models is very slow. How to apply these complex models to real business scenarios becomes a challenging but practical problem. Previous model compression methods usually suffer from information loss during the model compression procedure, leading to inferior models compared with the original one. To tackle this challenge, we propose a Two-stage Multi-teacher Knowledge Distillation (TMKD for short) method for web Question Answering system. We first develop a general Q\&A distillation task for student model pre-training, and further fine-tune this pre-trained student model with multi-teacher knowledge distillation on downstream tasks (like Web Q\&A task, MNLI, SNLI, RTE tasks from GLUE), which effectively reduces the overfitting bias in individual teacher models, and transfers more general knowledge to the student model. The experiment results show that our method can significantly outperform the baseline methods and even achieve comparable results with the original teacher models, along with substantial speedup of model inference.
翻译:深层次的训练前和微调模型(如BERT和OpenAI GPT)在回答问题方面表现出了极佳的结果,然而,由于模型参数数量之多,这些模型的推论速度非常缓慢。如何将这些复杂的模型应用于实际业务情景是一个具有挑战性但实际的问题。以前的模型压缩方法在模型压缩过程中通常会失去信息,导致比原模型低级模型。为了应对这一挑战,我们提议了一种双阶段多教师知识蒸馏方法,用于网络问题解答系统。我们首先为学生模型预培训开发了一个普通的 ⁇ 蒸馏任务,并进一步微调了这一预先培训的学生模型,在下游任务上(如Web ⁇ A任务、MNLI、SNLI、RTE任务),这有效地减少了个别教师模型中的过度偏差,并将更一般的知识传授给学生模型。实验结果表明,我们的方法可以大大超越基线方法,甚至比原始教师模型取得可比的结果,同时大幅度的速度。