无人注解的微调有线电视新闻网图像检索 (Fine-tuning CNN Image Retrieval with No Human Annotation)

Image descriptors based on activations of Convolutional Neural Networks (CNNs) have become dominant in image retrieval due to their discriminative power, compactness of representation, and search efficiency. Training of CNNs, either from scratch or fine-tuning, requires a large amount of annotated data, where a high quality of annotation is often crucial. In this work, we propose to fine-tune CNNs for image retrieval on a large collection of unordered images in a fully automated manner. Reconstructed 3D models obtained by the state-of-the-art retrieval and structure-from-motion methods guide the selection of the training data. We show that both hard-positive and hard-negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance of particular-object retrieval. CNN descriptor whitening discriminatively learned from the same training data outperforms commonly used PCA whitening. We propose a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and average pooling and show that it boosts retrieval performance. Applying the proposed method to the VGG network achieves state-of-the-art performance on the standard benchmarks: Oxford Buildings, Paris, and Holidays datasets.

翻译：基于革命神经网络(CNNs)激活的图像描述器在图像检索中占据主导地位,因为其具有歧视性的力量、代表性的紧凑性和搜索效率。对有线电视新闻网的培训,无论是从零到微调,还是从微调,都需要大量附加说明的数据,而高品质的批注往往至关重要。在这项工作中,我们提议微调有线电视新闻网,以便以完全自动化的方式对大量未经订购的图像收集进行图像检索。重新构建由最先进的检索和从动作结构中获得的3D模型,指导培训数据的选择。我们表明,通过利用3D模型的几何和摄像头位置选择的硬性和硬性反性实例,都能够提高特定对象检索的性能。CNNCS描述器从同一培训数据中差别化地吸取了通常使用的五氯苯甲醚白化数据。我们提议了一个新的通用和通用的3D(GEM)联合层,可以使平均的集合和平均集合化,并显示它能够提升巴黎州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州