We develop a method for the automated detection and segmentation of speech balloons in comic books, including their carrier and tails. Our method is based on a deep convolutional neural network that was trained on annotated pages of the Graphic Narrative Corpus. More precisely, we are using a fully convolutional network approach inspired by the U-Net architecture, combined with a VGG-16 based encoder. The trained model delivers state-of-the-art performance with an F1-score of over 0.94. Qualitative results suggest that wiggly tails, curved corners, and even illusory contours do not pose a major problem. Furthermore, the model has learned to distinguish speech balloons from captions. We compare our model to earlier results and discuss some possible applications.
Standard adversarial training involves two agents, namely a generator and a discriminator, playing a mini-max game. However, even if the players converge to an equilibrium, the generator may only recover a part of the target data distribution, in a situation commonly referred to as mode collapse. In this work, we present the Domain Partitioning Network (DoPaNet), a new approach to deal with mode collapse in generative adversarial learning. We employ multiple discriminators, each encouraging the generator to cover a different part of the target distribution. To ensure these parts do not overlap and collapse into the same mode, we add a classifier as a third agent in the game. The classifier decides which discriminator the generator is trained against for each sample. Through experiments on toy examples and real images, we show the merits of DoPaNet in covering the real distribution and its superiority with respect to the competing methods. Besides, we also show that we can control the modes from which samples are generated using DoPaNet.
Point clouds obtained with 3D scanners or by image-based reconstruction techniques are often corrupted with significant amount of noise and outliers. Traditional methods for point cloud denoising largely rely on local surface fitting (e.g., jets or MLS surfaces), local or non-local averaging, or on statistical assumptions about the underlying noise model. In contrast, we develop a simple data-driven method for removing outliers and reducing noise in unordered point clouds. We base our approach on a deep learning architecture adapted from PCPNet, which was recently proposed for estimating local 3D shape properties in point clouds. Our method first classifies and discards outlier samples, and then estimates correction vectors that project noisy points onto the original clean surfaces. The approach is efficient and robust to varying amounts of noise and outliers, while being able to handle large densely-sampled point clouds. In our extensive evaluation, both on synthesic and real data, we show an increased robustness to strong noise levels compared to various state-of-the-art methods, enabling accurate surface reconstruction from extremely noisy real data obtained by range scans. Finally, the simplicity and universality of our approach makes it very easy to integrate in any existing geometry processing pipeline.
Accurate segmentation of the prostate from magnetic resonance (MR) images provides useful information for prostate cancer diagnosis and treatment. However, automated prostate segmentation from 3D MR images still faces several challenges. For instance, a lack of clear edge between the prostate and other anatomical structures makes it challenging to accurately extract the boundaries. The complex background texture and large variation in size, shape and intensity distribution of the prostate itself make segmentation even further complicated. With deep learning, especially convolutional neural networks (CNNs), emerging as commonly used methods for medical image segmentation, the difficulty in obtaining large number of annotated medical images for training CNNs has become much more pronounced that ever before. Since large-scale dataset is one of the critical components for the success of deep learning, lack of sufficient training data makes it difficult to fully train complex CNNs. To tackle the above challenges, in this paper, we propose a boundary-weighted domain adaptive neural network (BOWDA-Net). To make the network more sensitive to the boundaries during segmentation, a boundary-weighted segmentation loss (BWL) is proposed. Furthermore, an advanced boundary-weighted transfer leaning approach is introduced to address the problem of small medical imaging datasets. We evaluate our proposed model on the publicly available MICCAI 2012 Prostate MR Image Segmentation (PROMISE12) challenge dataset. Our experimental results demonstrate that the proposed model is more sensitive to boundary information and outperformed other state-of-the-art methods.
The massive availability of cameras and personal devices results in a wide variability between imaging conditions, producing large intra-class variations and performance drop if such images are compared for person recognition. However, as biometric solutions are extensively deployed, it will be common to replace acquisition hardware as it is damaged or newer designs appear, or to exchange information between agencies or applications in heterogeneous environments. Furthermore, variations in imaging bands can also occur. For example, faces are typically acquired in the visible (VW) spectrum, while iris images are captured in the near-infrared (NIR) spectrum. However, cross-spectrum comparison may be needed if for example a face from a surveillance camera needs to be compared against a legacy iris database. Here, we propose a multialgorithmic approach to cope with cross-sensor periocular recognition. We integrate different systems using a fusion scheme based on linear logistic regression, in which fused scores tend to be log-likelihood ratios. This allows easy combination by just summing scores of available systems. We evaluate our approach in the context of the 1st Cross-Spectral Iris/Periocular Competition, whose aim was to compare person recognition approaches when periocular data from VW and NIR images is matched. The proposed fusion approach achieves reductions in error rates of up to 20-30% in cross-spectral NIR-VW comparison, leading to an EER of 0.22% and a FRR of just 0.62% for FAR=0.01%, representing the best overall approach of the mentioned competition.. Experiments are also reported with a database of VW images from two different smartphones, achieving even higher relative improvements in performance. We also discuss our approach from the point of view of template size and computation times, with the most computationally heavy system playing an important role in the results.
This paper presents an efficient approach for solving jointly facade registration and semantic segmentation. Progress in facade detection and recognition enable good initialization for the registration of a reference facade to a newly acquired target image. We propose here to rely on semantic segmentation to improve the accuracy of that initial registration. Simultaneously we aim to improve the quality of the semantic segmentation through the registration. These two problems are jointly solved in a Expectation-Maximization framework. We especially introduce a bayesian model that use prior semantic segmentation as well as geometric structure of the facade reference modeled by $L_p$ Gaussian Mixtures. We show the advantages of our method in term of robustness to clutter and change of illumination on urban images from various database.
Presently the most successful approaches to semi-supervised learning are based on consistency regularization, whereby a model is trained to be robust to small perturbations of its inputs and parameters. To understand consistency regularization, we conceptually explore how loss geometry interacts with training procedures. The consistency loss dramatically improves generalization performance over supervised-only training; however, we show that SGD struggles to converge on the consistency loss and continues to make large steps that lead to changes in predictions on the test data. Motivated by these observations, we propose to train consistency-based methods with Stochastic Weight Averaging (SWA), a recent approach which averages weights along the trajectory of SGD with a modified learning rate schedule. We also propose fast-SWA, which further accelerates convergence by averaging multiple points within each cycle of a cyclical learning rate schedule. With weight averaging, we achieve the best known semi-supervised results on CIFAR-10 and CIFAR-100, over many different quantities of labeled training data. For example, we achieve 5.0% error on CIFAR-10 with only 4000 labels, compared to the previous best result in the literature of 6.3%.
In this case study, we design, integrate and implement a cloud-enabled autonomous robotic navigation system. The system has the following features: map generation and robot coordination via cloud service and video streaming to allow online monitoring and control in case of emergency. The system has been tested to generate a map for a long corridor using two modes: manual and autonomous. The autonomous mode has shown more accurate map. In addition, the field experiments confirm the benefit of offloading the heavy computation to the cloud by significantly shortening the time required to build the map.
Multi-label image classification is a fundamental but challenging task towards general visual understanding. Existing methods found the region-level cues (e.g., features from RoIs) can facilitate multi-label classification. Nevertheless, such methods usually require laborious object-level annotations (i.e., object labels and bounding boxes) for effective learning of the object-level visual features. In this paper, we propose a novel and efficient deep framework to boost multi-label classification by distilling knowledge from weakly-supervised detection task without bounding box annotations. Specifically, given the image-level annotations, (1) we first develop a weakly-supervised detection (WSD) model, and then (2) construct an end-to-end multi-label image classification framework augmented by a knowledge distillation module that guides the classification model by the WSD model according to the class-level predictions for the whole image and the object-level visual features for object RoIs. The WSD model is the teacher model and the classification model is the student model. After this cross-task knowledge distillation, the performance of the classification model is significantly improved and the efficiency is maintained since the WSD model can be safely discarded in the test phase. Extensive experiments on two large-scale datasets (MS-COCO and NUS-WIDE) show that our framework achieves superior performances over the state-of-the-art methods on both performance and efficiency.
SLAM technology has recently seen many successes and attracted the attention of high-technological companies. However, how to unify the interface of existing or emerging algorithms, and effectively perform benchmark about the speed, robustness and portability are still problems. In this paper, we propose a novel SLAM platform named GSLAM, which not only provides evaluation functionality, but also supplies useful toolkit for researchers to quickly develop their own SLAM systems. The core contribution of GSLAM is an universal, cross-platform and full open-source SLAM interface for both research and commercial usage, which is aimed to handle interactions with input dataset, SLAM implementation, visualization and applications in an unified framework. Through this platform, users can implement their own functions for better performance with plugin form and further boost the application to practical usage of the SLAM.