Self-supervised learning based on instance discrimination has shown remarkable progress. In particular, contrastive learning, which regards each image as well as its augmentations as a separate class, and pushes all other images away, has been proved effective for pretraining. However, contrasting two images that are de facto similar in semantic space is hard for optimization and not applicable for general representations. In this paper, we tackle the representation inefficiency of contrastive learning and propose a hierarchical training strategy to explicitly model the invariance to semantic similar images in a bottom-up way. This is achieved by extending the contrastive loss to allow for multiple positives per anchor, and explicitly pulling semantically similar images/patches together at the earlier layers as well as the last embedding space. In this way, we are able to learn feature representation that is more discriminative throughout different layers, which we find is beneficial for fast convergence. The hierarchical semantic aggregation strategy produces more discriminative representation on several unsupervised benchmarks. Notably, on ImageNet with ResNet-50 as backbone, we reach $76.4\%$ top-1 accuracy with linear evaluation, and $75.1\%$ top-1 accuracy with only $10\%$ labels.
翻译:以实例歧视为基础的自我监督的学习取得了显著的进展。 特别是,对比式学习将每个图像及其扩增作为单独的一门,将所有其他图像都推开,已证明对预培训有效。然而,在语义空间中事实上相似的两种图像对比很难优化,不适用于一般表达方式。在本文中,我们处理对比式学习的无效率,并提出一个等级培训战略,以自下而上的方式明确模拟对语义相似图像的不一致性。这通过扩大对比式损失,允许每根锚有多个正数,并明确将早期层和最后嵌入空间的语义相似图像/插图一起拉动来实现。这样,我们可以学习不同层更具有歧视性的特征表达方式,我们认为这有利于快速趋同。分级语义组合战略在一些未超标度的基准上产生了更具有歧视性的表达方式。 值得注意的是,在图像网络上,以ResNet-50为主干线,我们达到76.4美元顶端-1美元的准确度,以线性评价为10.1美元,最高1美元标签仅以10.1美元为最高1美元。