Knowledge distillation is a popular technique for training a small student network to emulate a larger teacher model, such as an ensemble of networks. We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood: there often remains a surprisingly large discrepancy between the predictive distributions of the teacher and the student, even in cases when the student has the capacity to perfectly match the teacher. We identify difficulties in optimization as a key reason for why the student is unable to match the teacher. We also show how the details of the dataset used for distillation play a role in how closely the student matches the teacher -- and that more closely matching the teacher paradoxically does not always lead to better student generalization.
翻译:知识蒸馏是培训一个小型学生网络以仿效大教师模式(如网络组合)的一种流行技术。我们表明,虽然知识蒸馏可以改善学生普遍化,但通常不会按通常理解的方式发挥作用:教师和学生的预测分布之间往往存在着惊人的巨大差异,即使在学生有能力与教师完全匹配的情况下也是如此。我们发现,优化方面的困难是学生无法与教师匹配的主要原因。我们还展示了用于蒸馏的数据集的细节如何发挥作用,使学生与教师更加接近 -- -- 矛盾的是,更接近教师并不总是导致更好的学生普遍化。