Image-level feature descriptors obtained from convolutional neural networks have shown powerful representation capabilities for image retrieval. In this paper, we present an unsupervised method to aggregate deep convolutional features into compact yet discriminative image vectors by simulating the dynamics of heat diffusion. A distinctive problem in image retrieval is that repetitive or bursty features tend to dominate feature representations, leading to less than ideal matches. We show that by leveraging elegant properties of the heat equation, our method is able to select informative features while avoiding over-representation of bursty features. We additionally present a theoretical time complexity analysis showing the efficiency of our method, which is further demonstrated in our experimental evaluation. Finally, we extensively evaluate the proposed approach with pre-trained and fine-tuned deep networks on common public benchmarks, and show superior performance compared to previous work.

### 更多

Fast and scalable Content-Based Image Retrieval using visual features is required for document analysis, Medical image analysis, etc. in the present age. Convolutional Neural Network (CNN) activations as features achieved their outstanding performance in this area. Deep Convolutional representations using the softmax function in the output layer are also ones among visual features. However, almost all the image retrieval systems hold their index of visual features on main memory in order to high responsiveness, limiting their applicability for big data applications. In this paper, we propose a fast calculation method of cosine similarity with L2 norm indexed in advance on Elasticsearch. We evaluate our approach with ImageNet Dataset and VGG-16 pre-trained model. The evaluation results show the effectiveness and efficiency of our proposed method.

Recent studies in image retrieval task have shown that ensembling different models and combining multiple global descriptors lead to performance improvement. However, training different models for ensemble is not only difficult but also inefficient with respect to time or memory. In this paper, we propose a novel framework that exploits multiple global descriptors to get an ensemble-like effect while it can be trained in an end-to-end manner. The proposed framework is flexible and expandable by the global descriptor, CNN backbone, loss, and dataset. Moreover, we investigate the effectiveness of combining multiple global descriptors with quantitative and qualitative analysis. Our extensive experiments show that the combined descriptor outperforms a single global descriptor, as it can utilize different types of feature properties. In the benchmark evaluation, the proposed framework achieves the state-of-the-art performance on the CARS196, CUB200-2011, In-shop Clothes and Stanford Online Products on image retrieval tasks by a large margin compared to competing approaches. Our model implementations and pretrained models are publicly available.

Retrieving object instances among cluttered scenes efficiently requires compact yet comprehensive regional image representations. Intuitively, object semantics can help build the index that focuses on the most relevant regions. However, due to the lack of bounding-box datasets for objects of interest among retrieval benchmarks, most recent work on regional representations has focused on either uniform or class-agnostic region selection. In this paper, we first fill the void by providing a new dataset of landmark bounding boxes, based on the Google Landmarks dataset, that includes $94k$ images with manually curated boxes from $15k$ unique landmarks. Then, we demonstrate how a trained landmark detector, using our new dataset, can be leveraged to index image regions and improve retrieval accuracy while being much more efficient than existing regional methods. In addition, we further introduce a novel regional aggregated selective match kernel (R-ASMK) to effectively combine information from detected regions into an improved holistic image representation. R-ASMK boosts image retrieval accuracy substantially at no additional memory cost, while even outperforming systems that index image regions independently. Our complete image retrieval system improves upon the previous state-of-the-art by significant margins on the Revisited Oxford and Paris datasets. Code and data will be released.

This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute \emph{Word2VisualVec}, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec's properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results.

Image-level feature descriptors obtained from convolutional neural networks have shown powerful representation capabilities for image retrieval. In this paper, we present an unsupervised method to aggregate deep convolutional features into compact yet discriminative image vectors by simulating the dynamics of heat diffusion. A distinctive problem in image retrieval is that repetitive or bursty features tend to dominate feature representations, leading to less than ideal matches. We show that by leveraging elegant properties of the heat equation, our method is able to avoiding over-representation of bursty features. We additionally present a theoretical time complexity analysis showing the efficiency of our method, which is further demonstrated in our experimental evaluation. Finally, we extensively evaluate the proposed approach with pre-trained and fine-tuned deep networks on common public benchmarks, and show superior performance compared to previous work.

In this work we introduce a cross modal image retrieval system that allows both text and sketch as input modalities for the query. A cross-modal deep network architecture is formulated to jointly model the sketch and text input modalities as well as the the image output modality, learning a common embedding between text and images and between sketches and images. In addition, an attention model is used to selectively focus the attention on the different objects of the image, allowing for retrieval with multiple objects in the query. Experiments show that the proposed method performs the best in both single and multiple object image retrieval in standard datasets.

In this paper, we propose a simple but effective semantic-based aggregation (SBA) method. The proposed SBA utilizes the discriminative filters of deep convolutional layers as semantic detectors. Moreover, we propose the effective unsupervised strategy to select some semantic detectors to generate the "probabilistic proposals", which highlight certain discriminative pattern of objects and suppress the noise of background. The final global SBA representation could then be acquired by aggregating the regional representations weighted by the selected "probabilistic proposals" corresponding to various semantic content. Our unsupervised SBA is easy to generalize and achieves excellent performance on various tasks. We conduct comprehensive experiments and show that our unsupervised SBA outperforms the state-of-the-art unsupervised and supervised aggregation methods on image retrieval, place recognition and cloud classification.

Person Re-Identification (ReID) requires comparing two images of person captured under different conditions. Existing work based on neural networks often computes the similarity of feature maps from one single convolutional layer. In this work, we propose an efficient, end-to-end fully convolutional Siamese network that computes the similarities at multiple levels. We demonstrate that multi-level similarity can improve the accuracy considerably using low-complexity network structures in ReID problem. Specifically, first, we use several convolutional layers to extract the features of two input images. Then, we propose Convolution Similarity Network to compute the similarity score maps for the inputs. We use spatial transformer networks (STNs) to determine spatial attention. We propose to apply efficient depth-wise convolution to compute the similarity. The proposed Convolution Similarity Networks can be inserted into different convolutional layers to extract visual similarities at different levels. Furthermore, we use an improved ranking loss to further improve the performance. Our work is the first to propose to compute visual similarities at low, middle and high levels for ReID. With extensive experiments and analysis, we demonstrate that our system, compact yet effective, can achieve competitive results with much smaller model size and computational complexity.

Hash coding has been widely used in the approximate nearest neighbor search for large-scale image retrieval. Recently, many deep hashing methods have been proposed and shown largely improved performance over traditional feature-learning-based methods. Most of these methods examine the pairwise similarity on the semantic-level labels, where the pairwise similarity is generally defined in a hard-assignment way. That is, the pairwise similarity is '1' if they share no less than one class label and '0' if they do not share any. However, such similarity definition cannot reflect the similarity ranking for pairwise images that hold multiple labels. In this paper, a new deep hashing method is proposed for multi-label image retrieval by re-defining the pairwise similarity into an instance similarity, where the instance similarity is quantified into a percentage based on the normalized semantic labels. Based on the instance similarity, a weighted cross-entropy loss and a minimum mean square error loss are tailored for loss-function construction, and are efficiently used for simultaneous feature learning and hash coding. Experiments on three popular datasets demonstrate that, the proposed method outperforms the competing methods and achieves the state-of-the-art performance in multi-label image retrieval.

We propose an attentive local feature descriptor suitable for large-scale image retrieval, referred to as DELF (DEep Local Feature). The new feature is based on convolutional neural networks, which are trained only with image-level annotations on a landmark image dataset. To identify semantically useful local features for image retrieval, we also propose an attention mechanism for keypoint selection, which shares most network layers with the descriptor. This framework can be used for image retrieval as a drop-in replacement for other keypoint detectors and descriptors, enabling more accurate feature matching and geometric verification. Our system produces reliable confidence scores to reject false positives---in particular, it is robust against queries that have no correct match in the database. To evaluate the proposed descriptor, we introduce a new large-scale dataset, referred to as Google-Landmarks dataset, which involves challenges in both database and query such as background clutter, partial occlusion, multiple landmarks, objects in variable scales, etc. We show that DELF outperforms the state-of-the-art global and local descriptors in the large-scale setting by significant margins. Code and dataset can be found at the project webpage: https://github.com/tensorflow/models/tree/master/research/delf .

HeeJae Jun,ByungSoo Ko,Youngjoon Kim,Insik Kim,Jongtack Kim
3+阅读 · 2019年4月18日
Marvin Teichmann,Andre Araujo,Menglong Zhu,Jack Sim
14+阅读 · 2018年12月4日
Jianfeng Dong,Xirong Li,Cees G. M. Snoek
5+阅读 · 2018年7月14日
Shanmin Pang,Jin Ma,Jianru Xue,Jihua Zhu,Vicente Ordonez
3+阅读 · 2018年5月25日
5+阅读 · 2018年4月28日
Jian Xu,Chunheng Wang,Chengzuo Qi,Cunzhao Shi,Baihua Xiao
8+阅读 · 2018年4月3日
Yiluan Guo,Ngai-Man Cheung
4+阅读 · 2018年4月2日
Zheng Zhang,Qin Zou,Qian Wang,Yuewei Lin,Qingquan Li
5+阅读 · 2018年3月19日
Hyeonwoo Noh,Andre Araujo,Jack Sim,Tobias Weyand,Bohyung Han
3+阅读 · 2018年2月3日

47+阅读 · 2020年5月31日

20+阅读 · 2019年10月17日

20+阅读 · 2019年10月11日

CreateAMind
9+阅读 · 2019年5月22日
CreateAMind
26+阅读 · 2019年1月3日
CreateAMind
7+阅读 · 2018年12月10日

16+阅读 · 2018年9月24日
CreateAMind
15+阅读 · 2018年5月25日

16+阅读 · 2018年3月16日
CreateAMind
6+阅读 · 2018年2月7日
CreateAMind
4+阅读 · 2017年10月31日
Top