Detecting objects in a video is a compute-intensive task. In this paper we propose CaTDet, a system to speedup object detection by leveraging the temporal correlation in video. CaTDet consists of two DNN models that form a cascaded detector, and an additional tracker to predict regions of interests based on historic detections. We also propose a new metric, mean Delay(mD), which is designed for latency-critical video applications. Experiments on the KITTI dataset show that CaTDet reduces operation count by 5.1-8.7x with the same mean Average Precision(mAP) as the single-model Faster R-CNN detector and incurs additional delay of 0.3 frame. On CityPersons dataset, CaTDet achieves 13.0x reduction in operations with 0.8% mAP loss.
Online multi-object tracking (MOT) is extremely important for high-level spatial reasoning and path planning for autonomous and highly-automated vehicles. In this paper, we present a modular framework for tracking multiple objects (vehicles), capable of accepting object proposals from different sensor modalities (vision and range) and a variable number of sensors, to produce continuous object tracks. This work is a generalization of the MDP framework for MOT, with some key extensions - First, we track objects across multiple cameras and across different sensor modalities. This is done by fusing object proposals across sensors accurately and efficiently. Second, the objects of interest (targets) are tracked directly in the real world. This is a departure from traditional techniques where objects are simply tracked in the image plane. Doing so allows the tracks to be readily used by an autonomous agent for navigation and related tasks. To verify the effectiveness of our approach, we test it on real world highway data collected from a heavily sensorized testbed capable of capturing full-surround information. We demonstrate that our framework is well-suited to track objects through entire maneuvers around the ego-vehicle, some of which take more than a few minutes to complete. We also leverage the modularity of our approach by comparing the effects of including/excluding different sensors, changing the total number of sensors, and the quality of object proposals on the final tracking result.
There is a growing interest in designing models that can deal with images from different visual domains. If there exists a universal structure in different visual domains that can be captured via a common parameterization, then we can use a single model for all domains rather than one model per domain. A model aware of the relationships between different domains can also be trained to work on new domains with less resources. However, to identify the reusable structure in a model is not easy. In this paper, we propose a multi-domain learning architecture based on depthwise separable convolution. The proposed approach is based on the assumption that images from different domains share cross-channel correlations but have domain-specific spatial correlations. The proposed model is compact and has minimal overhead when being applied to new domains. Additionally, we introduce a gating mechanism to promote soft sharing between different domains. We evaluate our approach on Visual Decathlon Challenge, a benchmark for testing the ability of multi-domain models. The experiments show that our approach can achieve the highest score while only requiring 50% of the parameters compared with the state-of-the-art approaches.
Over the past few years, Spiking Neural Networks (SNNs) have become popular as a possible pathway to enable low-power event-driven neuromorphic hardware. However, their application in machine learning have largely been limited to very shallow neural network architectures for simple problems. In this paper, we propose a novel algorithmic technique for generating an SNN with a deep architecture, and demonstrate its effectiveness on complex visual recognition problems such as CIFAR-10 and ImageNet. Our technique applies to both VGG and Residual network architectures, with significantly better accuracy than the state-of-the-art. Finally, we present analysis of the sparse event-driven computations to demonstrate reduced hardware overhead when operating in the spiking domain.
Segmentation is a key stage in dermoscopic image processing, where the accuracy of the border line that defines skin lesions is of utmost importance for subsequent algorithms (e.g., classification) and computer-aided early diagnosis of serious medical conditions. This paper proposes a novel segmentation method based on Local Binary Patterns (LBP), where LBP and K-Means clustering are combined to achieve a detailed delineation in dermoscopic images. In comparison with usual dermatologist-like segmentation (i.e., the available ground-truth), the proposed method is capable of finding more realistic borders of skin lesions, i.e., with much more detail. The results also exhibit reduced variability amongst different performance measures and they are consistent across different images. The proposed method can be applied for cell-based like segmentation adapted to the lesion border growing specificities. Hence, the method is suitable to follow the growth dynamics associated with the lesion border geometry in skin melanocytic images.
Outdoor videos sometimes contain unexpected rain streaks due to the rainy weather, which bring negative effects on subsequent computer vision applications, e.g., video surveillance, object recognition and tracking, etc. In this paper, we propose a directional regularized tensor-based video deraining model by taking into consideration the arbitrary direction of rain streaks. In particular, the sparsity of rain streaks in spatial and derivative domains, the spatiotemporal sparsity and low-rank property of video background are incorporated into the proposed method. Different from many previous methods under the assumption of vertically falling rain streaks, we consider a more realistic assumption that all the rain streaks in a video fall in an approximately similar arbitrary direction. The resulting complicated optimization problem will be effectively solved through an alternating direction method. Comprehensive experiments on both synthetic and realistic datasets have demonstrated the superiority of the proposed deraining method.
High-quality dehazing performance is highly dependent upon the accurate estimation of transmission map. In this work, the coarse estimation version is first obtained by weightedly fusing two different transmission maps, which are generated from foreground and sky regions, respectively. A hybrid variational model with promoted regularization terms is then proposed to assisting in refining transmission map. The resulting complicated optimization problem is effectively solved via an alternating direction algorithm. The final haze-free image can be effectively obtained according to the refined transmission map and atmospheric scattering model. Our dehazing framework has the capacity of preserving important image details while suppressing undesirable artifacts, even for hazy images with large sky regions. Experiments on both synthetic and realistic images have illustrated that the proposed method is competitive with or even outperforms the state-of-the-art dehazing techniques under different imaging conditions.
Vision-based person, hand or face detection approaches have achieved incredible success in recent years with the development of deep convolutional neural network (CNN). In this paper, we take the inherent correlation between the body and body parts into account and propose a new framework to boost up the detection performance of the multi-level objects. In particular, we adopt a region-based object detection structure with two carefully designed detectors to separately pay attention to the human body and body parts in a coarse-to-fine manner, which we call Detector-in-Detector network (DID-Net). The first detector is designed to detect human body, hand, and face. The second detector, based on the body detection results of the first detector, mainly focus on the detection of small hand and face inside each body. The framework is trained in an end-to-end way by optimizing a multi-task loss. Due to the lack of human body, face and hand detection dataset, we have collected and labeled a new large dataset named Human-Parts with 14,962 images and 106,879 annotations. Experiments show that our method can achieve excellent performance on Human-Parts.
Nuclear segmentation within Haematoxylin & Eosin stained histology images is a fundamental prerequisite in the digital pathology work-flow, due to the ability for nuclear features to act as key diagnostic markers. The development of automated methods for nuclear segmentation enables the quantitative analysis of tens of thousands of nuclei within a whole-slide pathology image, opening up possibilities of further analysis of large-scale nuclear morphometry. However, automated nuclear segmentation is faced with a major challenge in that there are several different types of nuclei, some of them exhibiting large intra-class variability such as the tumour cells. Additionally, some of the nuclei are often clustered together. To address these challenges, we present a novel convolutional neural network for automated nuclear segmentation that leverages the instance-rich information encoded within the vertical and horizontal distances of nuclear pixels to their centres of mass. These distances are then utilised to separate clustered nuclei, resulting in an accurate segmentation, particularly in areas with overlapping instances. We demonstrate state-of-the-art performance compared to other methods on four independent multi-tissue histology image datasets. Furthermore, we propose an interpretable and reliable evaluation framework that effectively quantifies nuclear segmentation performance and overcomes the limitations of existing performance measures.
In this paper we present a self-supervised method for representation learning utilizing two different modalities. Based on the observation that cross-modal information has a high semantic meaning we propose a method to effectively exploit this signal. For our approach we utilize video data since it is available on a large scale and provides easily accessible modalities given by RGB and optical flow. We demonstrate state-of-the-art performance on highly contested action recognition datasets in the context of self-supervised learning. We show that our feature representation also transfers to other tasks and conduct extensive ablation studies to validate our core contributions. Code and model can be found at https://github.com/nawidsayed/Cross-and-Learn.
Face de-identification algorithms have been developed in response to the prevalent use of public video recordings and surveillance cameras. Here, we evaluated the success of identity masking in the context of monitoring drivers as they actively operate a motor vehicle. We compared the effectiveness of eight de-identification algorithms using human perceivers. The algorithms we tested included the personalized supervised bilinear regression method for Facial Action Transfer (FAT), the DMask method, which renders a generic avatar face, and two edge-detection methods implemented with and without image polarity inversion (Canny, Scharr). We also used an Overmask approach that combined the FAT and Canny methods. We compared these identity masking methods to identification of an unmasked video of the driver. Human subjects were tested in a standard face recognition experiment in which they learned driver identities with a high resolution (studio-style) image, and were tested subsequently on their ability to recognize masked and unmasked videos of these individuals driving. All masking methods lowered identification accuracy substantially, relative to the unmasked video. The most successful methods, DMask and Canny, lowered human identification performance to near random. In all cases, identifications were made with stringent decision criteria indicating the subjects had low confidence in their decisions. We conclude that carefully tested de-identification approaches, used alone or in combination, can be an effective tool for protecting the privacy of individuals captured in videos. Future work should examine how the most effective methods fare in preserving facial action recognition.
Deep generative models like variational autoencoders approximate the intrinsic geometry of high dimensional data manifolds by learning low-dimensional latent-space variables and an embedding function. The geometric properties of these latent spaces has been studied under the lens of Riemannian geometry; via analysis of the non-linearity of the generator function. In new developments, deep generative models have been used for learning semantically meaningful `disentangled' representations; that capture task relevant attributes while being invariant to other attributes. In this work, we explore the geometry of popular generative models for disentangled representation learning. We use several metrics to compare the properties of latent spaces of disentangled representation models in terms of class separability and curvature of the latent-space. The results we obtain establish that the class distinguishable features in the disentangled latent space exhibits higher curvature as opposed to a variational autoencoder. We evaluate and compare the geometry of three such models with variational autoencoder on two different datasets. Further, our results show that distances and interpolation in the latent space are significantly improved with Riemannian metrics derived from the curvature of the space. We expect these results will have implications on understanding how deep-networks can be made more robust, generalizable, as well as interpretable.
While modern convolutional neural networks achieve outstanding accuracy on many image classification tasks, they are, compared to humans, much more sensitive to image degradation. Here, we describe a variant of Batch Normalization, LocalNorm, that regularizes the normalization layer in the spirit of Dropout while dynamically adapting to the local image intensity and contrast at test-time. We show that the resulting deep neural networks are much more resistant to noise-induced image degradation, improving accuracy by up to three times, while achieving the same or slightly better accuracy on non-degraded classical benchmarks. In computational terms, LocalNorm adds negligible training cost and little or no cost at inference time, and can be applied to already-trained networks in a straightforward manner.
Environmental air quality affects people's life, obtaining real-time and accurate environmental air quality has a profound guiding significance for the development of social activities. At present, environmental air quality measurement mainly adopts the method that setting air quality detector at specific monitoring points in cities and timing sampling analysis, which is easy to be restricted by time and space factors. Some air quality measurement algorithms related to deep learning mostly adopt a single convolutional neural network to train the whole image, which will ignore the difference of different parts of the image. In this paper, we propose a method for air quality measurement based on double-channel convolutional neural network ensemble learning to solve the problem of feature extraction for different parts of environmental images. Our method mainly includes two aspects: ensemble learning of double-channel convolutional neural network and self-learning weighted feature fusion. We constructed a double-channel convolutional neural network, used each channel to train different parts of the environment images for feature extraction. We propose a feature weight self-learning method, which weights and concatenates the extracted feature vectors, and uses the fused feature vectors to measure air quality. Our method can be applied to the two tasks of air quality grade measurement and air quality index (AQI) measurement. Moreover, we build an environmental image dataset of random time and location condition. The experiments show that our method can achieve nearly 82% average accuracy and a small average absolute error on our test set. At the same time, through contrast experiment, we proved that our proposed method obtained considerable increase in performance compared with single channel convolutional neural network air quality measurements.