Multi-label image classification is a fundamental but challenging task towards general visual understanding. Existing methods found the region-level cues (e.g., features from RoIs) can facilitate multi-label classification. Nevertheless, such methods usually require laborious object-level annotations (i.e., object labels and bounding boxes) for effective learning of the object-level visual features. In this paper, we propose a novel and efficient deep framework to boost multi-label classification by distilling knowledge from weakly-supervised detection task without bounding box annotations. Specifically, given the image-level annotations, (1) we first develop a weakly-supervised detection (WSD) model, and then (2) construct an end-to-end multi-label image classification framework augmented by a knowledge distillation module that guides the classification model by the WSD model according to the class-level predictions for the whole image and the object-level visual features for object RoIs. The WSD model is the teacher model and the classification model is the student model. After this cross-task knowledge distillation, the performance of the classification model is significantly improved and the efficiency is maintained since the WSD model can be safely discarded in the test phase. Extensive experiments on two large-scale datasets (MS-COCO and NUS-WIDE) show that our framework achieves superior performances over the state-of-the-art methods on both performance and efficiency.
A rapidly growing area of work has studied the existence of adversarial examples, datapoints which have been perturbed to fool a classifier, but the vast majority of these works have focused primarily on threat models defined by $\ell_p$ norm-bounded perturbations. In this paper, we propose a new threat model for adversarial attacks based on the Wasserstein distance. In the image classification setting, such distances measure the cost of moving pixel mass, which naturally cover "standard" image manipulations such as scaling, rotation, translation, and distortion (and can potentially be applied to other settings as well). To generate Wasserstein adversarial examples, we develop a procedure for projecting onto the Wasserstein ball, based upon a modified version of the Sinkhorn iteration. The resulting algorithm can successfully attack image classification models, bringing traditional CIFAR10 models down to 3% accuracy within a Wasserstein ball with radius 0.1 (i.e., moving 10% of the image mass 1 pixel), and we demonstrate that PGD-based adversarial training can improve this adversarial accuracy to 76%. In total, this work opens up a new direction of study in adversarial robustness, more formally considering convex metrics that accurately capture the invariances that we typically believe should exist in classifiers. Code for all experiments in the paper is available at https://github.com/locuslab/projected_sinkhorn.
State-of-the-art face recognition algorithms are able to achieve good performance when sufficient training images are provided. Unfortunately, the number of facial images is limited in some real face recognition applications. In this paper, we propose ComplexFace, a novel and effective algorithm for face recognition with limited samples using complex number based data augmentation. The algorithm first generates new representations from original samples and then fuse both into complex numbers, which avoids the difficulty of weight setting in other fusion approaches. A test sample can then be expressed by the linear combination of all the training samples, which mapped the sample to the new representation space for classification by the kernel function. The collaborative representation based classifier is then built to make predictions. Extensive experiments on the Georgia Tech (GT) face database and the ORL face database show that our algorithm significantly outperforms existing methods: the average errors of previous approaches ranging from 31.66% to 41.75% are reduced to 14.54% over the GT database; the average errors of previous approaches ranging from 5.21% to 10.99% are reduced to 1.67% over the ORL database. In other words, our algorithm has decreased the average errors by up to 84.80% on the ORL database.
Deep neural networks provide unprecedented performance in all image classification problems, leveraging the availability of huge amounts of data for training. Recent studies, however, have shown their vulnerability to adversarial attacks, spawning an intense research effort in this field. With the aim of building better systems, new countermeasures and stronger attacks are proposed by the day. On the attacker's side, there is growing interest for the realistic black-box scenario, in which the user has no access to the neural network parameters. The problem is to design limited-complexity attacks which mislead the neural network without impairing image quality too much, not to raise the attention of human observers. In this work, we put special emphasis on this latter requirement and propose a powerful and low-complexity black-box attack which preserves perceptual image quality. Numerical experiments prove the effectiveness of the proposed techniques both for tasks commonly considered in this context, and for other applications in biometrics (face recognition) and forensics (camera model identification).
Supervised deep learning relies on the assumption that enough training data is available, which presents a problem for its application to several fields, like medical imaging. On the example of a binary image classification task (breast cancer recognition), we show that pretraining a generative model for meaningful image augmentation helps enhance the performance of the resulting classifier. By augmenting the data, performance on downstream classification tasks could be improved even with a relatively small training set. We show that this "adversarial augmentation" yields promising results compared to classical image augmentation on the example of breast cancer classification.
Residual neural networks (ResNets) are a promising class of deep neural networks that have shown excellent performance for a number of learning tasks, e.g., image classification and recognition. Mathematically, ResNet architectures can be interpreted as forward Euler discretizations of a nonlinear initial value problem whose time-dependent control variables represent the weights of the neural network. Hence, training a ResNet can be cast as an optimal control problem of the associated dynamical system. For similar time-dependent optimal control problems arising in engineering applications, parallel-in-time methods have shown notable improvements in scalability. This paper demonstrates the use of those techniques for efficient and effective training of ResNets. The proposed algorithms replace the classical (sequential) forward and backward propagation through the network layers by a parallel nonlinear multigrid iteration applied to the layer domain. This adds a new dimension of parallelism across layers that is attractive when training very deep networks. From this basic idea, we derive multiple layer-parallel methods. The most efficient version employs a simultaneous optimization approach where updates to the network parameters are based on inexact gradient information in order to speed up the training process. Using numerical examples from supervised classification, we demonstrate that the new approach achieves similar training performance to traditional methods, but enables layer-parallelism and thus provides speedup over layer-serial methods through greater concurrency.
Convolutional neural networks excel in a number of computer vision tasks. One of their most crucial architectural elements is the effective receptive field size, that has to be manually set to accommodate a specific task. Standard solutions involve large kernels, down/up-sampling and dilated convolutions. These require testing a variety of dilation and down/up-sampling factors and result in non-compact representations and excessive number of parameters. We address this issue by proposing a new convolution filter composed of displaced aggregation units (DAU). DAUs learn spatial displacements and adapt the receptive field sizes of individual convolution filters to a given problem, thus eliminating the need for hand-crafted modifications. DAUs provide a seamless substitution of convolutional filters in existing state-of-the-art architectures, which we demonstrate on AlexNet, ResNet50, ResNet101, DeepLab and SRN-DeblurNet. The benefits of this design are demonstrated on a variety of computer vision tasks and datasets, such as image classification (ILSVRC 2012), semantic segmentation (PASCAL VOC 2011, Cityscape) and blind image de-blurring (GOPRO). Results show that DAUs efficiently allocate parameters resulting in up to four times more compact networks at similar or better performance.