知识蒸馏的温度 (Curriculum Temperature for Knowledge Distillation)

Most existing distillation methods ignore the flexible role of the temperature in the loss function and fix it as a hyper-parameter that can be decided by an inefficient grid search. In general, the temperature controls the discrepancy between two distributions and can faithfully determine the difficulty level of the distillation task. Keeping a constant temperature, i.e., a fixed level of task difficulty, is usually sub-optimal for a growing student during its progressive learning stages. In this paper, we propose a simple curriculum-based technique, termed Curriculum Temperature for Knowledge Distillation (CTKD), which controls the task difficulty level during the student's learning career through a dynamic and learnable temperature. Specifically, following an easy-to-hard curriculum, we gradually increase the distillation loss w.r.t. the temperature, leading to increased distillation difficulty in an adversarial manner. As an easy-to-use plug-in technique, CTKD can be seamlessly integrated into existing knowledge distillation frameworks and brings general improvements at a negligible additional computation cost. Extensive experiments on CIFAR-100, ImageNet-2012, and MS-COCO demonstrate the effectiveness of our method. Our code is available at https://github.com/zhengli97/CTKD.

翻译：多数现有蒸馏方法忽视了温度在流失功能中的灵活作用,将其固定为超参数,可以通过低效率的网格搜索来决定。一般而言,温度控制两种分布之间的差异,并能够忠实地确定蒸馏任务的难度。保持恒定温度,即固定的工作难度,对于正在成长的学生来说,对于在逐步学习阶段的不断增长的学生来说,通常是亚最佳的。在本文中,我们提议一种简单的课程基础技术,称为“知识蒸馏课程温度”,它通过动态和可学习的温度控制学生学习生涯中的任务难度。具体地说,根据简单易懂的课程,我们逐渐增加蒸馏损失 w.r.t.,温度导致蒸馏困难的增加。作为一种容易使用的插头技术,CTKD可以顺利地融入现有的知识蒸馏框架,并以微不足道的额外成本进行总体改进。在CFAR-100、图像Net-2012和MS-97/COSheng.我们可以用的MAR-D/MLA/KSheng/CO方法的有效性。