(1) 仅在最后的输出部分使用相似性损失与判别损失;

(2) 采用所有老师模型的平均概率作为更强的监督信息进行蒸馏。

该文提到一个非常重要的发现:在蒸馏阶段不应当使用one-hot方式的标签编码。这样一种简单的方案可以取得SOTA性能,且并未用到以下几种常见涨点tricks:(1)类似ResNet50-D的架构改进;(2)额外训练数据;(3) AutoAug、RandAug等;(4)cosine学习率机制;(5)mixup/cutmix数据增广策略;(6) 标签平滑。

在ImageNet数据集上,本文所提方法取得了80.67%的Top1精度(single crop@224),以极大的优势超越其他同架构方案。该方法可以视作采用知识蒸馏对ResNet50涨点的一个新的基准,该文可谓首个在不改变网路架构、无需额外训练数据的前提下将ResNet提升到超过80%Top1精度的方法。



Recent studies revealed the mathematical connection of deep neural network (DNN) and dynamic system. However, the fundamental principle of DNN has not been fully characterized with dynamic system in terms of optimization and generalization. To this end, we build the connection of DNN and continuity equation where the measure is conserved to model the forward propagation process of DNN which has not been addressed before. DNN learns the transformation of the input distribution to the output one. However, in the measure space, there are infinite curves connecting two distributions. Which one can lead to good optimization and generaliztion for DNN? By diving the optimal transport theory, we find DNN with weight decay attempts to learn the geodesic curve in the Wasserstein space, which is induced by the optimal transport map. Compared with plain network, ResNet is a better approximation to the geodesic curve, which explains why ResNet can be optimized and generalize better. Numerical experiments show that the data tracks of both plain network and ResNet tend to be line-shape in term of line-shape score (LSS), and the map learned by ResNet is closer to the optimal transport map in term of optimal transport score (OTS). In a word, we conclude a mathematical principle of deep learning is to learn the geodesic curve in the Wasserstein space; and deep learning is a great engineering realization of continuous transformation in high-dimensional space.