Standard practice in training neural networks involves initializing the weights in an independent fashion. The results of recent work suggest that feature "diversity" at initialization plays an important role in training the network. However, other initialization schemes with reduced feature diversity have also been shown to be viable. In this work, we conduct a series of experiments aimed at elucidating the importance of feature diversity at initialization. We show that a complete lack of diversity is harmful to training, but its effects can be counteracted by a relatively small addition of noise - even the noise in standard non-deterministic GPU computations is sufficient. Furthermore, we construct a deep convolutional network with identical features at initialization and almost all of the weights initialized at 0 that can be trained to reach accuracy matching its standard-initialized counterpart.
This paper tackles a 2D architecture vectorization problem, whose task is to infer an outdoor building architecture as a 2D planar graph from a single RGB image. We provide a new benchmark with ground-truth annotations for 2,001 complex buildings across the cities of Atlanta, Paris, and Las Vegas. We also propose a novel algorithm utilizing 1) convolutional neural networks (CNNs) that detects geometric primitives and classifies their relationships and 2) an integer programming (IP) that assembles the information into a 2D planar graph. While being a trivial task for human vision, the inference of a graph structure with an arbitrary topology is still an open problem for computer vision. Qualitative and quantitative evaluations demonstrate that our algorithm makes significant improvements over the current state-of-the-art, towards an intelligent system at the level of human perception. We will share code and data to promote further research.
Learning joint embedding space for various modalities is of vital importance for multimodal fusion. Mainstream modality fusion approaches fail to achieve this goal, leaving a modality gap which heavily affects cross-modal fusion. In this paper, we propose a novel adversarial encoder-decoder-classifier framework to learn a modality-invariant embedding space. Since the distributions of various modalities vary in nature, to reduce the modality gap, we translate the distributions of source modalities into that of target modality via their respective encoders using adversarial training. Furthermore, we exert additional constraints on embedding space by introducing reconstruction loss and classification loss. Then we fuse the encoded representations using hierarchical graph neural network which explicitly explores unimodal, bimodal and trimodal interactions in multi-stage. Our method achieves state-of-the-art performance on multiple datasets. Visualization of the learned embeddings suggests that the joint embedding space learned by our method is discriminative.
Modern machine learning frameworks are complex: they are typically organised in multiple layers each of which is written in a different language and they depend on a number of external libraries, but at their core they mainly consist of tensor operations. As array-oriented languages provide perfect abstractions to implement tensor operations, we consider a minimalistic machine learning framework that is shallowly embedded in an array-oriented language and we study its productivity and performance. We do this by implementing a state of the art Convolutional Neural Network (CNN) and compare it against implementations in TensorFlow and PyTorch --- two state of the art industrial-strength frameworks. It turns out that our implementation is 2 and 3 times faster, even after fine-tuning the TensorFlow and PyTorch to our hardware --- a 64-core GPU-accelerated machine. The size of all three CNN specifications is the same, about 150 lines of code. Our mini framework is 150 lines of highly reusable hardware-agnostic code that does not depend on external libraries. The compiler for a host array language automatically generates parallel code for a chosen architecture. The key to such a balance between performance and portability lies in the design of the array language; in particular, the ability to express rank-polymorphic operations concisely, yet being able to do optimisations across them. This design builds on very few assumptions, and it is readily transferable to other contexts offering a clean approach to high-performance machine learning.
We explore an ensembled $\Sigma$-net for fast parallel MR imaging, including parallel coil networks, which perform implicit coil weighting, and sensitivity networks, involving explicit sensitivity maps. The networks in $\Sigma$-net are trained in a supervised way, including content and GAN losses, and with various ways of data consistency, i.e., proximal mappings, gradient descent and variable splitting. A semi-supervised finetuning scheme allows us to adapt to the k-space data at test time, which, however, decreases the quantitative metrics, although generating the visually most textured and sharp images. For this challenge, we focused on robust and high SSIM scores, which we achieved by ensembling all models to a $\Sigma$-net.
We design a classifier for transactional datasets with application in malware detection. We build the classifier based on the minimum description length (MDL) principle. This involves selecting a model that best compresses the training dataset for each class considering the MDL criterion. To select a model for a dataset, we first use clustering followed by closed frequent pattern mining to extract a subset of closed frequent patterns (CFPs). We show that this method acts as a pattern summarization method to avoid pattern explosion; this is done by giving priority to longer CFPs, and without requiring to extract all CFPs. We then use the MDL criterion to further summarize extracted patterns, and construct a code table of patterns. This code table is considered as the selected model for the compression of the dataset. We evaluate our classifier for the problem of static malware detection in portable executable (PE) files. We consider API calls of PE files as their distinguishing features. The presence-absence of API calls forms a transactional dataset. Using our proposed method, we construct two code tables, one for the benign training dataset, and one for the malware training dataset. Our dataset consists of 19696 benign, and 19696 malware samples, each a binary sequence of size 22761. We compare our classifier with deep neural networks providing us with the state-of-the-art performance. The comparison shows that our classifier performs very close to deep neural networks. We also discuss that our classifier is an interpretable classifier. This provides the motivation to use this type of classifiers where some degree of explanation is required as to why a sample is classified under one class rather than the other class.
Computational imaging systems jointly design computation and hardware to retrieve information which is not traditionally accessible with standard imaging systems. Recently, critical aspects such as experimental design and image priors are optimized through deep neural networks formed by the unrolled iterations of classical physics-based reconstructions (termed physics-based networks). However, for real-world large-scale systems, computing gradients via backpropagation restricts learning due to memory limitations of graphical processing units. In this work, we propose a memory-efficient learning procedure that exploits the reversibility of the network's layers to enable data-driven design for large-scale computational imaging. We demonstrate our methods practicality on two large-scale systems: super-resolution optical microscopy and multi-channel magnetic resonance imaging.
Skin-like tactile sensors provide robots with rich feedback related to the force distribution applied to their soft surface. The complexity of interpreting raw tactile information has driven the use of machine learning algorithms to convert the sensory feedback to the quantities of interest. However, the lack of ground truth sources for the entire contact force distribution has mainly limited these techniques to the sole estimation of the total contact force and the contact center on the sensor's surface. The method presented in this article uses a finite element model to obtain ground truth data for the three-dimensional force distribution. The model is obtained with state-of-the-art material characterization methods and is evaluated in an indentation setup, where it shows high agreement with the measurements retrieved from a commercial force-torque sensor. The proposed technique is applied to a vision-based tactile sensor, which aims to reconstruct the contact force distribution purely from images. Thousands of images are matched to ground truth data and are used to train a neural network architecture, which is suitable for real-time predictions.
Standard practice in training neural networks involves initializing the weights in an independent fashion. The results of recent work suggest that feature "diversity" at initialization plays an important role in training the network. However, other initialization schemes with reduced feature diversity have also been shown to be viable. In this work, we conduct a series of experiments aimed at elucidating the importance of feature diversity at initialization. Experimenting on a shallow network, we show that a complete lack of diversity is harmful to training, but its effect can be counteracted by a relatively small addition of noise. Furthermore, we construct a deep convolutional network with identical features at initialization and almost all of the weights initialized at 0 that can be trained to reach accuracy matching its standard-initialized counterpart.
We present a new approach to integrating deep learning with knowledge-based systems that we believe shows promise. Our approach seeks to emulate reasoning structure, which can be inspected part-way through, rather than simply learning reasoner answers, which is typical in many of the black-box systems currently in use. We demonstrate that this idea is feasible by training a long short-term memory (LSTM) artificial neural network to learn EL+ reasoning patterns with two different data sets. We also show that this trained system is resistant to noise by corrupting a percentage of the test data and comparing the reasoner's and LSTM's predictions on corrupt data with correct answers.
In this paper we introduce Principal Filter Analysis (PFA), an easy to use and effective method for neural network compression. PFA exploits the correlation between filter responses within network layers to recommend a smaller network that maintain as much as possible the accuracy of the full model. We propose two algorithms: the first allows users to target compression to specific network property, such as number of trainable variable (footprint), and produces a compressed model that satisfies the requested property while preserving the maximum amount of spectral energy in the responses of each layer, while the second is a parameter-free heuristic that selects the compression used at each layer by trying to mimic an ideal set of uncorrelated responses. Since PFA compresses networks based on the correlation of their responses we show in our experiments that it gains the additional flexibility of adapting each architecture to a specific domain while compressing. PFA is evaluated against several architectures and datasets, and shows considerable compression rates without compromising accuracy, e.g., for VGG-16 on CIFAR-10, CIFAR-100 and ImageNet, PFA achieves a compression rate of 8x, 3x, and 1.4x with an accuracy gain of 0.4%, 1.4% points, and 2.4% respectively. Our tests show that PFA is competitive with state-of-the-art approaches while removing adoption barriers thanks to its practical implementation, intuitive philosophy and ease of use.
Capsule Networks (CapsNets) are brand-new architectures that have shown ground-breaking results in certain areas of Computer Vision (CV). In 2017, Hinton and his team introduced CapsNets with routing-by-agreement in "Sabour et al" and in a more recent paper "Matrix Capsules with EM Routing" they proposed a more complete architecture with Expectation-Maximization (EM) algorithm. Unlike the traditional convolutional neural networks (CNNs), this architecture is able to preserve the pose of the objects in the picture. Due to this characteristic, it has been able to beat the previous state-of-theart results on the smallNORB dataset, which includes samples with various view points. Also, this architecture is more robust to white box adversarial attacks. However, CapsNets have two major drawbacks. They can't perform as well as CNNs on complex datasets and, they need a huge amount of time for training. We try to mitigate these shortcomings by finding optimum settings of EM routing iterations for training CapsNets. Unlike the past studies, we use un-equal numbers of EM routing iterations for different stages of the CapsNet. For our research, we use three datasets: Yale face dataset, Belgium Traffic Sign dataset, and Fashion-MNIST dataset.
In software engineering, log statement is an important part because programmers can't access to users' program and they can only rely on log message to find the root of bugs. The mechanism of "log level" allows developers and users to specify the appropriate amount of logs to print during the execution of the software. And 26\% of the log statement modification is to modify the level. We tried to use ML method to predict the suitable level of log statement. The specific model is GGNN(gated graph neural network) and we have drawn lessons from Microsoft's research. In this work, we apply Graph Neural Networks to predict the usage of log statement level of some open source java projects from github. Given the good performance of GGNN in this task, we are confident that GGNN is an excellent choice for processing source code. We envision this model can play an important role in applying AI/ML technique for Software Development Life Cycle more broadly.
Most problems in natural language processing can be approximated as inverse problems such as analysis and generation at variety of levels from morphological (e.g., cat+Plural <-> cats) to semantic (e.g., (call + 1 2) <-> "Calculate one plus two."). Although the tasks in both directions are closely related, general approach in the field has been to design separate models specific for each task. However, having one shared model for both tasks, would help the researchers exploit the common knowledge among these problems with reduced time and memory requirements. We investigate a specific class of neural networks, called Invertible Neural Networks (INNs) (Ardizzone et al. 2019) that enable simultaneous optimization in both directions, hence allow addressing of inverse problems via a single model. In this study, we investigate INNs on morphological problems casted as inverse problems. We apply INNs to various morphological tasks with varying ambiguity and show that they provide competitive performance in both directions. We show that they are able to recover the morphological input parameters, i.e., predicting the lemma (e.g., cat) or the morphological tags (e.g., Plural) when run in the reverse direction, without any significant performance drop in the forward direction, i.e., predicting the surface form (e.g., cats).