Large scale deep learning excels when labeled images are abundant, yet data-efficient learning remains a longstanding challenge. While biological vision is thought to leverage vast amounts of unlabeled data to solve classification problems with limited supervision, computer vision has so far not succeeded in this `semi-supervised' regime. Our work tackles this challenge with Contrastive Predictive Coding, an unsupervised objective which extracts stable structure from still images. The result is a representation which, equipped with a simple linear classifier, separates ImageNet categories better than all competing methods, and surpasses the performance of a fully-supervised AlexNet model. When given a small number of labeled images (as few as 13 per class), this representation retains a strong classification performance, outperforming state-of-the-art semi-supervised methods by 10% Top-5 accuracy and supervised methods by 20%. Finally, we find our unsupervised representation to serve as a useful substrate for image detection on the PASCAL-VOC 2007 dataset, approaching the performance of representations trained with a fully annotated ImageNet dataset. We expect these results to open the door to pipelines that use scalable unsupervised representations as a drop-in replacement for supervised ones for real-world vision tasks where labels are scarce.
翻译:当标签图像丰富时,大规模深层次的学习非常出色,但数据效率高的学习仍是一个长期挑战。虽然生物愿景被认为利用大量未贴标签的数据在有限的监督下解决分类问题,但计算机愿景迄今在这种“半监督”制度下没有成功。我们的工作用一个未监督的目标来解决这一挑战,这个目标从静止图像中提取稳定的结构。结果显示我们没有监督的表示方式,它配备了一个简单的线性分类器,将图像网络的类别比所有竞争性方法的类别区分得更好,并超过了完全监督的亚历克斯网模型的性能。当给出少量标签图像(每类只有13个)时,这种表示方式保持了强大的分类性能,超过最先进的半监督方法,达到10 % 顶5 准确度, 并有20 % 的监督方法。 最后,我们发现我们没有监督的表示方式,可以作为用于在 PASACL-VOC 2007 数据集中进行图像检测的有用子标分数, 并且超越了完全监督的图像网络模型的性能表现。我们期望这些最先进的半监督的图像网络显示结果, 用于在真正的标签上被监督的升级。