从场景图像中学习 (Unsupervised Object-Level Representation Learning from Scene Images)

Contrastive self-supervised learning has largely narrowed the gap to supervised pre-training on ImageNet. However, its success highly relies on the object-centric priors of ImageNet, i.e., different augmented views of the same image correspond to the same object. Such a heavily curated constraint becomes immediately infeasible when pre-trained on more complex scene images with many objects. To overcome this limitation, we introduce Object-level Representation Learning (ORL), a new self-supervised learning framework towards scene images. Our key insight is to leverage image-level self-supervised pre-training as the prior to discover object-level semantic correspondence, thus realizing object-level representation learning from scene images. Extensive experiments on COCO show that ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks. Furthermore, ORL improves the downstream performance when more unlabeled scene images are available, demonstrating its great potential of harnessing unlabeled data in the wild. We hope our approach can motivate future research on more general-purpose unsupervised representation learning from scene data. Project page: https://www.mmlab-ntu.com/project/orl/.

翻译：自我监督的自我监督学习在很大程度上缩小了在图像网络上接受监督前培训的差距。但是,它的成功在很大程度上依赖于图像网络的物体中心前端, 也就是说, 对同一图像的不同增强的视图与同一对象相对应。当对许多天体的更复杂的场景图像进行预先培训时, 这样的大量调整的限制立即变得不可行。为了克服这一限制, 我们引入了目标级别代表学习( ORL ), 一个针对现场图像的新的自我监督学习框架。我们的关键洞察力是利用图像层面的自我监督前端培训前端作为发现目标级别语义通信的先端, 从而实现从场景图像中进行目标层面的代表学习。 COCOCO的大规模实验显示, ORL 大大改进了在现场图像上自我监督学习的性能, 甚至超越了监督的图像网络前方对若干下游任务的培训。此外, ORL 改进了下游的性能, 展示了在野外使用未加标签的图像的巨大潜力。我们希望我们的方法能够激励未来对更通用的图像进行研究: http/ Propervidustrationalpulations: http/ produstrublistrualmentalmental

相关内容

表示学习

关注 186

表示学习是通过利用训练数据来学习得到向量表示，这可以克服人工方法的局限性。表示学习通常可分为两大类，无监督和有监督表示学习。大多数无监督表示学习方法利用自动编码器（如去噪自动编码器和稀疏自动编码器等）中的隐变量作为表示。目前出现的变分自动编码器能够更好的容忍噪声和异常值。然而，推断给定数据的潜在结构几乎是不可能的。目前有一些近似推断的策略。此外，一些无监督表示学习方法旨在近似某种特定的相似性度量。提出了一种无监督的相似性保持表示学习框架，该框架使用矩阵分解来保持成对的DTW相似性。通过学习保持DTW的shaplets，即在转换后的空间中的欧式距离近似原始数据的真实DTW距离。有监督表示学习方法可以利用数据的标签信息，更好地捕获数据的语义结构。孪生网络和三元组网络是目前两种比较流行的模型，它们的目标是最大化类别之间的距离并最小化了类别内部的距离。