Multi-label image classification is a fundamental but challenging task towards general visual understanding. Existing methods found the region-level cues (e.g., features from RoIs) can facilitate multi-label classification. Nevertheless, such methods usually require laborious object-level annotations (i.e., object labels and bounding boxes) for effective learning of the object-level visual features. In this paper, we propose a novel and efficient deep framework to boost multi-label classification by distilling knowledge from weakly-supervised detection task without bounding box annotations. Specifically, given the image-level annotations, (1) we first develop a weakly-supervised detection (WSD) model, and then (2) construct an end-to-end multi-label image classification framework augmented by a knowledge distillation module that guides the classification model by the WSD model according to the class-level predictions for the whole image and the object-level visual features for object RoIs. The WSD model is the teacher model and the classification model is the student model. After this cross-task knowledge distillation, the performance of the classification model is significantly improved and the efficiency is maintained since the WSD model can be safely discarded in the test phase. Extensive experiments on two large-scale datasets (MS-COCO and NUS-WIDE) show that our framework achieves superior performances over the state-of-the-art methods on both performance and efficiency.

A rapidly growing area of work has studied the existence of adversarial examples, datapoints which have been perturbed to fool a classifier, but the vast majority of these works have focused primarily on threat models defined by $\ell_p$ norm-bounded perturbations. In this paper, we propose a new threat model for adversarial attacks based on the Wasserstein distance. In the image classification setting, such distances measure the cost of moving pixel mass, which naturally cover "standard" image manipulations such as scaling, rotation, translation, and distortion (and can potentially be applied to other settings as well). To generate Wasserstein adversarial examples, we develop a procedure for projecting onto the Wasserstein ball, based upon a modified version of the Sinkhorn iteration. The resulting algorithm can successfully attack image classification models, bringing traditional CIFAR10 models down to 3% accuracy within a Wasserstein ball with radius 0.1 (i.e., moving 10% of the image mass 1 pixel), and we demonstrate that PGD-based adversarial training can improve this adversarial accuracy to 76%. In total, this work opens up a new direction of study in adversarial robustness, more formally considering convex metrics that accurately capture the invariances that we typically believe should exist in classifiers. Code for all experiments in the paper is available at https://github.com/locuslab/projected_sinkhorn.

State-of-the-art face recognition algorithms are able to achieve good performance when sufficient training images are provided. Unfortunately, the number of facial images is limited in some real face recognition applications. In this paper, we propose ComplexFace, a novel and effective algorithm for face recognition with limited samples using complex number based data augmentation. The algorithm first generates new representations from original samples and then fuse both into complex numbers, which avoids the difficulty of weight setting in other fusion approaches. A test sample can then be expressed by the linear combination of all the training samples, which mapped the sample to the new representation space for classification by the kernel function. The collaborative representation based classifier is then built to make predictions. Extensive experiments on the Georgia Tech (GT) face database and the ORL face database show that our algorithm significantly outperforms existing methods: the average errors of previous approaches ranging from 31.66% to 41.75% are reduced to 14.54% over the GT database; the average errors of previous approaches ranging from 5.21% to 10.99% are reduced to 1.67% over the ORL database. In other words, our algorithm has decreased the average errors by up to 84.80% on the ORL database.

Deep neural networks provide unprecedented performance in all image classification problems, leveraging the availability of huge amounts of data for training. Recent studies, however, have shown their vulnerability to adversarial attacks, spawning an intense research effort in this field. With the aim of building better systems, new countermeasures and stronger attacks are proposed by the day. On the attacker's side, there is growing interest for the realistic black-box scenario, in which the user has no access to the neural network parameters. The problem is to design limited-complexity attacks which mislead the neural network without impairing image quality too much, not to raise the attention of human observers. In this work, we put special emphasis on this latter requirement and propose a powerful and low-complexity black-box attack which preserves perceptual image quality. Numerical experiments prove the effectiveness of the proposed techniques both for tasks commonly considered in this context, and for other applications in biometrics (face recognition) and forensics (camera model identification).

Supervised deep learning relies on the assumption that enough training data is available, which presents a problem for its application to several fields, like medical imaging. On the example of a binary image classification task (breast cancer recognition), we show that pretraining a generative model for meaningful image augmentation helps enhance the performance of the resulting classifier. By augmenting the data, performance on downstream classification tasks could be improved even with a relatively small training set. We show that this "adversarial augmentation" yields promising results compared to classical image augmentation on the example of breast cancer classification.

Residual neural networks (ResNets) are a promising class of deep neural networks that have shown excellent performance for a number of learning tasks, e.g., image classification and recognition. Mathematically, ResNet architectures can be interpreted as forward Euler discretizations of a nonlinear initial value problem whose time-dependent control variables represent the weights of the neural network. Hence, training a ResNet can be cast as an optimal control problem of the associated dynamical system. For similar time-dependent optimal control problems arising in engineering applications, parallel-in-time methods have shown notable improvements in scalability. This paper demonstrates the use of those techniques for efficient and effective training of ResNets. The proposed algorithms replace the classical (sequential) forward and backward propagation through the network layers by a parallel nonlinear multigrid iteration applied to the layer domain. This adds a new dimension of parallelism across layers that is attractive when training very deep networks. From this basic idea, we derive multiple layer-parallel methods. The most efficient version employs a simultaneous optimization approach where updates to the network parameters are based on inexact gradient information in order to speed up the training process. Using numerical examples from supervised classification, we demonstrate that the new approach achieves similar training performance to traditional methods, but enables layer-parallelism and thus provides speedup over layer-serial methods through greater concurrency.

Convolutional neural networks excel in a number of computer vision tasks. One of their most crucial architectural elements is the effective receptive field size, that has to be manually set to accommodate a specific task. Standard solutions involve large kernels, down/up-sampling and dilated convolutions. These require testing a variety of dilation and down/up-sampling factors and result in non-compact representations and excessive number of parameters. We address this issue by proposing a new convolution filter composed of displaced aggregation units (DAU). DAUs learn spatial displacements and adapt the receptive field sizes of individual convolution filters to a given problem, thus eliminating the need for hand-crafted modifications. DAUs provide a seamless substitution of convolutional filters in existing state-of-the-art architectures, which we demonstrate on AlexNet, ResNet50, ResNet101, DeepLab and SRN-DeblurNet. The benefits of this design are demonstrated on a variety of computer vision tasks and datasets, such as image classification (ILSVRC 2012), semantic segmentation (PASCAL VOC 2011, Cityscape) and blind image de-blurring (GOPRO). Results show that DAUs efficiently allocate parameters resulting in up to four times more compact networks at similar or better performance.

While modern convolutional neural networks achieve outstanding accuracy on many image classification tasks, they are, compared to humans, much more sensitive to image degradation. Here, we describe a variant of Batch Normalization, LocalNorm, that regularizes the normalization layer in the spirit of Dropout while dynamically adapting to the local image intensity and contrast at test-time. We show that the resulting deep neural networks are much more resistant to noise-induced image degradation, improving accuracy by up to three times, while achieving the same or slightly better accuracy on non-degraded classical benchmarks. In computational terms, LocalNorm adds negligible training cost and little or no cost at inference time, and can be applied to already-trained networks in a straightforward manner.

Hyperspectral image (HSI) classification is widely used for the analysis of remotely sensed images. Hyperspectral imagery includes varying bands of images. Convolutional Neural Network (CNN) is one of the most frequently used deep learning based methods for visual data processing. The use of CNN for HSI classification is also visible in recent works. These approaches are mostly based on 2D CNN. Whereas, the HSI classification performance is highly dependent on both spatial and spectral information. Very few methods have utilized the 3D CNN because of increased computational complexity. This letter proposes a Hybrid Spectral Convolutional Neural Network (HybridSN) for HSI classification. Basically, the HybridSN is a spectral-spatial 3D-CNN followed by spatial 2D-CNN. The 3D-CNN facilitates the joint spatial-spectral feature representation from a stack of spectral bands. The 2D-CNN on top of the 3D-CNN further learns more abstract level spatial representation. Moreover, the use of hybrid CNNs reduces the complexity of the model compared to 3D-CNN alone. To test the performance of this hybrid approach, very rigorous HSI classification experiments are performed over Indian Pines, Pavia University and Salinas Scene remote sensing datasets. The results are compared with the state-of-the-art hand-crafted as well as end-to-end deep learning based methods. A very satisfactory performance is obtained using the proposed HybridSN for HSI classification. The source code can be found at \url{https://github.com/gokriznastic/HybridSN}.

Deep neural networks (DNNs) have produced state-of-the-art results in many benchmarks and problem domains. However, the success of DNNs depends on the proper configuration of its architecture and hyperparameters. Such a configuration is difficult and as a result, DNNs are often not used to their full potential. In addition, DNNs in commercial applications often need to satisfy real-world design constraints such as size or number of parameters. To make configuration easier, automatic machine learning (AutoML) systems for deep learning have been developed, focusing mostly on optimization of hyperparameters. This paper takes AutoML a step further. It introduces an evolutionary AutoML framework called LEAF that not only optimizes hyperparameters but also network architectures and the size of the network. LEAF makes use of both state-of-the-art evolutionary algorithms (EAs) and distributed computing frameworks. Experimental results on medical image classification and natural language analysis show that the framework can be used to achieve state-of-the-art performance. In particular, LEAF demonstrates that architecture optimization provides a significant boost over hyperparameter optimization, and that networks can be minimized at the same time with little drop in performance. LEAF therefore forms a foundation for democratizing and improving AI, as well as making AI practical in future applications.

While modern convolutional neural networks achieve outstanding accuracy on many image classification tasks, they are, compared to humans, much more sensitive to image degradation. Here, we describe a variant of Batch Normalization, LocalNorm, that regularizes the normalization layer in the spirit of Dropout while dynamically adapting to the local image intensity and contrast at test-time. We show that the resulting deep neural networks are much more resistant to noise-induced image degradation, improving accuracy by up to three times, while achieving the same or slightly better accuracy on non-degraded classical benchmarks. In computational terms, LocalNorm adds negligible training cost and little or no cost at inference time, and can be applied to already-trained networks in a straightforward manner.

Recent studies on multi-label image classification have focused on designing more complex architectures of deep neural networks such as the use of attention mechanisms and region proposal networks. Although performance gains have been reported, the backbone deep models of the proposed approaches and the evaluation metrics employed in different works vary, making it difficult to compare each fairly. Moreover, due to the lack of properly investigated baselines, the advantage introduced by the proposed techniques are often ambiguous. To address these issues, we make a thorough investigation of the mainstream deep convolutional neural network architectures for multi-label image classification and present a strong baseline. With the use of proper data augmentation techniques and model ensembles, the basic deep architectures can achieve better performance than many existing more complex ones on three benchmark datasets, providing great insight for the future studies on multi-label image classification.

Glioblastoma multiform (GBM) is a kind of head tumor with an extraordinarily complex treatment process. The survival period is typically 14-16 months, and the 2 year survival rate is approximately 26%-33%. The clinical treatment strategies for the pseudoprogression (PsP) and true tumor progression (TTP) of GBM are different, so accurately distinguishing these two conditions is particularly significant.As PsP and TTP of GBM are similar in shape and other characteristics, it is hard to distinguish these two forms with precision. In order to differentiate them accurately, this paper introduces a feature learning method based on a generative adversarial network: DC-Al GAN. GAN consists of two architectures: generator and discriminator. Alexnet is used as the discriminator in this work. Owing to the adversarial and competitive relationship between generator and discriminator, the latter extracts highly concise features during training. In DC-Al GAN, features are extracted from Alexnet in the final classification phase, and the highly nature of them contributes positively to the classification accuracy.The generator in DC-Al GAN is modified by the deep convolutional generative adversarial network (DCGAN) by adding three convolutional layers. This effectively generates higher resolution sample images. Feature fusion is used to combine high layer features with low layer features, allowing for the creation and use of more precise features for classification. The experimental results confirm that DC-Al GAN achieves high accuracy on GBM datasets for PsP and TTP image classification, which is superior to other state-of-the-art methods.

Comparing with enormous research achievements targeting better image classification models, efforts applied to object detector training are dwarfed in terms of popularity and universality. Due to significantly more complex network structures and optimization targets, various training strategies and pipelines are specifically designed for certain detection algorithms and no other. In this work, we explore universal tweaks that help boosting the performance of state-of-the-art object detection models to a new level without sacrificing inference speed. Our experiments indicate that these freebies can be as much as 5% absolute precision increase that everyone should consider applying to object detection training to a certain degree.

Deep learning has been at the foundation of large improvements in image classification. To improve the robustness of predictions, Bayesian approximations have been used to learn parameters in deep neural networks. We follow an alternative approach, by using Gaussian processes as building blocks for Bayesian deep learning models, which has recently become viable due to advances in inference for convolutional and deep structure. We investigate deep convolutional Gaussian processes, and identify a problem that holds back current performance. To remedy the issue, we introduce a translation insensitive convolutional kernel, which removes the restriction of requiring identical outputs for identical patch inputs. We show empirically that this convolutional kernel improves performances in both shallow and deep models. On MNIST, FASHION-MNIST and CIFAR-10 we improve previous GP models in terms of accuracy, with the addition of having more calibrated predictive probabilities than simple DNN models.

Top