Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations? Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, we find striking differences between the two architectures, such as ViT having more uniform representations across all layers. We explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information, and ViT residual connections, which strongly propagate features from lower to higher layers. We study the ramifications for spatial localization, demonstrating ViTs successfully preserve input spatial information, with noticeable effects from different classification methods. Finally, we study the effect of (pretraining) dataset scale on intermediate features and transfer learning, and conclude with a discussion on connections to new architectures such as the MLP-Mixer.
翻译:至今为止,革命性神经网络(CNNs)一直是视觉数据的实际模型。最近的工作表明(Vision)变异模型(VIT)在图像分类任务上能够取得可比甚至优异的性能。这提出了一个中心问题:愿景变异者如何解决这些任务?它们的行为是像革命网络一样,还是学习完全不同的视觉表现?分析VIT和CNN的图像分类基准的内部代表结构,我们发现两个结构之间的显著差异,例如ViT在所有层次上都有更加一致的表述。我们探索这些差异是如何产生的,发现自我关注所发挥的关键作用,从而能够及早汇总全球信息,以及ViT的剩余连接,这些连接从下层到上层有力地传播各种特征。我们研究空间定位的影响,展示ViTs成功地保存了输入空间信息,以及不同分类方法的显著影响。最后,我们研究(预先培训)数据集比例对中间特征和传输学习的影响,并最后讨论与MLP-Mixer等新结构的连接。