SiT: 自监督的 vIsion 变换器 (SiT: Self-supervised vIsion Transformer)

Self-supervised learning methods are gaining increasing traction in computer vision due to their recent success in reducing the gap with supervised learning. In natural language processing (NLP) self-supervised learning and transformers are already the methods of choice. The recent literature suggests that the transformers are becoming increasingly popular also in computer vision. So far, the vision transformers have been shown to work well when pretrained either using a large scale supervised data or with some kind of co-supervision, e.g. in terms of teacher network. These supervised pretrained vision transformers achieve very good results in downstream tasks with minimal changes. In this work we investigate the merits of self-supervised learning for pretraining image/vision transformers and then using them for downstream classification tasks. We propose Self-supervised vIsion Transformers (SiT) and discuss several self-supervised training mechanisms to obtain a pretext model. The architectural flexibility of SiT allows us to use it as an autoencoder and work with multiple self-supervised tasks seamlessly. We show that a pretrained SiT can be finetuned for a downstream classification task on small scale datasets, consisting of a few thousand images rather than several millions. The proposed approach is evaluated on standard datasets using common protocols. The results demonstrate the strength of the transformers and their suitability for self-supervised learning. We outperformed existing self-supervised learning methods by large margin. We also observed that SiT is good for few shot learning and also showed that it is learning useful representation by simply training a linear classifier on top of the learned features from SiT. Pretraining, finetuning, and evaluation codes will be available under: https://github.com/Sara-Ahmed/SiT.

翻译：自监督的学习方法由于最近成功地缩小了受监督的学习差距,在计算机视野中正在逐渐获得越来越多的牵引力。在自然语言处理(NLP)自监督的学习和变压器的自然语言处理(NLP)中,自监督的学习和变压器已经成为选择方法。最近的文献表明变压器在计算机视野中也越来越受欢迎。到目前为止,在预先训练时,通过使用大规模受监督的数据或某种共同监督的观察,例如在教师网络方面,这些视觉变压器在计算机视野中正在日益增强。这些经过监督的预培训的视觉变压器在下游任务中取得了非常良好的效果。在这个工作中,我们调查了自我监督的图像/变压器学习的优点,然后使用它们来进行下游分类。我们提出了自我监督的变压式变压器(SiT), 并且通过现有的自动变压式系统,我们可以用它作为自动变压的自动变压工具,我们可以通过多种自我调整的任务来进行无缝的操作。我们显示,在Si-T前的精细的变压式上,对于一个比下游任务的精细的精细的变压的变换方法, 也展示了它们用来在小规模变压的普通的学习结果中展示了现有的数据。在小规模的自我结果中展示了。