** We present a method for gating deep-learning architectures on a fine-grained level. Individual convolutional maps are turned on/off conditionally on features in the network. This method allows us to train neural networks with a large capacity, but lower inference time than the full network. To achieve this, we introduce a new residual block architecture that gates convolutional channels in a fine-grained manner. We also introduce a generally applicable tool "batch-shaping" that matches the marginal aggregate posteriors of features in a neural network to a pre-specified prior distribution. We use this novel technique to force gates to be more conditional on the data. We present results on CIFAR-10 and ImageNet datasets for image classification and Cityscapes for semantic segmentation. Our results show that our method can slim down large architectures conditionally, such that the average computational cost on the data is on par with a smaller architecture, but with higher accuracy. In particular, our ResNet34 gated network achieves a performance of 72.55% top-1 accuracy compared to the 69.76% accuracy of the baseline ResNet18 model, for similar complexity. We also show that the resulting networks automatically learn to use more features for difficult examples and fewer features for simple examples. **

** Falling in Parkinsonian syndromes (PS) is associated with postural instability and consists a common cause of disability among PS patients. Current posturographic practices record the body's center-of-pressure displacement (statokinesigram) while the patient stands on a force platform. Statokinesigrams, after appropriate signal processing, can offer numerous posturographic features, which however challenges the efforts for valid statistics via standard univariate approaches. In this work, we present the ts-AUC, a non-parametric multivariate two-sample test, which we employ to analyze statokinesigram differences among PS patients that are fallers (PSf) and non-fallers (PSNF). We included 123 PS patients who were classified into PSF or PSNF based on clinical assessment and underwent simple Romberg Test (eyes open/eyes closed). We analyzed posturographic features using both multiple testing with p-value adjustment and the ts-AUC. While the ts-AUC showed significant difference between groups (p-value = 0.01), multiple testing did not show any such difference. Interestingly, significant difference between the two groups was found only using the open-eyes protocol. PSF showed significantly increased antero-posterior movements as well as increased posturographic area, compared to PSNF. Our study demonstrates the superiority of the ts-AUC test compared to standard statistical tools in distinguishing PSF and PSNF in the multidimensional feature space. This result highlights more generally the fact that machine learning-based statistical tests can be seen as a natural extension of classical statistical approaches and should be considered, especially when dealing with multifactorial assessments. **

** Neural networks using transformer-based architectures have recently demonstrated great power and flexibility in modeling sequences of many types. One of the core components of transformer networks is the attention layer, which allows contextual information to be exchanged among sequence elements. While many of the prevalent network structures thus far have utilized full attention -- which operates on all pairs of sequence elements -- the quadratic scaling of this attention mechanism significantly constrains the size of models that can be trained. In this work, we present an attention model that has only linear requirements in memory and computation time. We show that, despite the simpler attention model, networks using this attention mechanism can attain comparable performance to full attention networks on language modeling tasks. **

** Risk adjustment has become an increasingly important tool in healthcare. It has been extensively applied to payment adjustment for health plans to reflect the expected cost of providing coverage for members. Risk adjustment models are typically estimated using linear regression, which does not fully exploit the information in claims data. Moreover, the development of such linear regression models requires substantial domain expert knowledge and computational effort for data preprocessing. In this paper, we propose a novel approach for risk adjustment that uses semantic embeddings to represent patient medical histories. Embeddings efficiently represent medical concepts learned from diagnostic, procedure, and prescription codes in patients' medical histories. This approach substantially reduces the need for feature engineering. Our results show that models using embeddings had better performance than a commercial risk adjustment model on the task of prospective risk score prediction. **

** Complete characterization of states and processes that occur within quantum devices is crucial for understanding and testing their potential to outperform classical technologies for communications and computing. However, this task becomes unwieldy for large and complex quantum systems. Here we realize and experimentally demonstrate a method for complete characterization of a harmonic oscillator based on an artificial neural network known as the restricted Boltzmann machine. We apply the method to experimental balanced homodyne tomography and show it to allow full estimation of quantum states based on a smaller amount of experimental data. Although our experiment is in the optical domain, our method provides a way of exploring quantum resources in a broad class of physical systems, such as superconducting circuits, atomic and molecular ensembles, and optomechanical systems. **

** We present a technique to perform dimensionality reduction on data that is subject to uncertainty. Our method is a generalization of traditional principal component analysis (PCA) to multivariate probability distributions. In comparison to non-linear methods, linear dimensionality reduction techniques have the advantage that the characteristics of such probability distributions remain intact after projection. We derive a representation of the PCA sample covariance matrix that respects potential uncertainty in each of the inputs, building the mathematical foundation of our new method uncertainty-aware PCA. In addition to the accuracy and performance gained by our approach over sampling-based strategies, our formulation allows us to perform sensitivity analysis with regard to the uncertainty in the data. For this, we propose factor traces as a novel visualization that enables to better understand the influence of uncertainty on the chosen principal components. We provide multiple examples of our technique using real-world datasets. As a special case, we show how to propagate multivariate normal distributions through PCA in closed-form. Furthermore, we discuss extensions and limitations of our approach. **

** Generative models of natural images have progressed towards high fidelity samples by the strong leveraging of scale. We attempt to carry this success to the field of video modeling by showing that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity than previous work. Our proposed network, Dual Video Discriminator GAN (DVD-GAN), scales to longer and higher resolution videos by leveraging a computationally efficient decomposition of its discriminator. We evaluate on the related tasks of video synthesis and video prediction, and achieve new state of the art Frechet Inception Distance on prediction for Kinetics-600, as well as state of the art Inception Score for synthesis on the UCF-101 dataset, alongside establishing a number of strong baselines on Kinetics-600. **

** Optimizing nonlinear systems involving expensive computer experiments with regard to conflicting objectives is a common challenge. When the number of experiments is severely restricted and/or when the number of objectives increases, uncovering the whole set of Pareto optimal solutions is out of reach, even for surrogate-based approaches: the proposed solutions are sub-optimal or do not cover the front well. As non-compromising optimal solutions have usually little point in applications, this work restricts the search to solutions that are close to the Pareto front center. The article starts by characterizing this center, which is defined for any type of front. Next, a Bayesian multi-objective optimization method for directing the search towards it is proposed. Targeting a subset of the Pareto front allows an improved optimality of the solutions and a better coverage of this zone, which is our main concern. A criterion for detecting convergence to the center is described. If the criterion is triggered, a widened central part of the Pareto front is targeted such that sufficiently accurate convergence to it is forecasted within the remaining budget. Numerical experiments show how the resulting algorithm, C-EHI, better locates the central part of the Pareto front when compared to state-of-the-art Bayesian algorithms. **

** Recently deep learning-based methods have been applied in image compression and achieved many promising results. In this paper, we propose an improved hybrid layered image compression framework by combining deep learning and the traditional image codecs. At the encoder, we first use a convolutional neural network (CNN) to obtain a compact representation of the input image, which is losslessly encoded by the FLIF codec as the base layer of the bit stream. A coarse reconstruction of the input is obtained by another CNN from the reconstructed compact representation. The residual between the input and the coarse reconstruction is then obtained and encoded by the H.265/HEVC-based BPG codec as the enhancement layer of the bit stream. Experimental results using the Kodak and Tecnick datasets show that the proposed scheme outperforms the state-of-the-art deep learning-based layered coding scheme and traditional codecs including BPG in both PSNR and MS-SSIM metrics across a wide range of bit rates, when the images are coded in the RGB444 domain. **

** We provide recovery guarantees for compressible signals that have been corrupted with noise and extend the framework introduced in [1] to defend neural networks against $\ell_0$-norm and $\ell_2$-norm attacks. Concretely, for a signal that is approximately sparse in some transform domain and has been perturbed with noise, we provide guarantees for accurately recovering the signal in the transform domain. We can then use the recovered signal to reconstruct the signal in its original domain while largely removing the noise. Our results are general as they can be directly applied to most unitary transforms used in practice and hold for both $\ell_0$-norm bounded noise and $\ell_2$-norm bounded noise. In the case of $\ell_0$-norm bounded noise, we prove recovery guarantees for Iterative Hard Thresholding (IHT) and Basis Pursuit (BP). For the case of $\ell_2$-norm bounded noise, we provide recovery guarantees for BP. These guarantees theoretically bolster the defense framework introduced in [1] for defending neural networks against adversarial inputs. Finally, we experimentally demonstrate this defense framework using both IHT and BP against the One Pixel Attack [21], Carlini-Wagner $\ell_0$ and $\ell_2$ attacks [3], Jacobian Saliency Based attack [18], and the DeepFool attack [17] on CIFAR-10 [12], MNIST [13], and Fashion-MNIST [27] datasets. This expands beyond the experimental demonstrations of [1]. **

** One of the challenges in display advertising is that the distribution of features and click through rate (CTR) can exhibit large shifts over time due to seasonality, changes to ad campaigns and other factors. The predominant strategy to keep up with these shifts is to train predictive models continuously, on fresh data, in order to prevent them from becoming stale. However, in many ad systems positive labels are only observed after a possibly long and random delay. These delayed labels pose a challenge to data freshness in continuous training: fresh data may not have complete label information at the time they are ingested by the training algorithm. Naive strategies which consider any data point a negative example until a positive label becomes available tend to underestimate CTR, resulting in inferior user experience and suboptimal performance for advertisers. The focus of this paper is to identify the best combination of loss functions and models that enable large-scale learning from a continuous stream of data in the presence of delayed labels. In this work, we compare 5 different loss functions, 3 of them applied to this problem for the first time. We benchmark their performance in offline settings on both public and proprietary datasets in conjunction with shallow and deep model architectures. We also discuss the engineering cost associated with implementing each loss function in a production environment. Finally, we carried out online experiments with the top performing methods, in order to validate their performance in a continuous training scheme. While training on 668 million in-house data points offline, our proposed methods outperform previous state-of-the-art by 3% relative cross entropy (RCE). During online experiments, we observed 55% gain in revenue per thousand requests (RPMq) against naive log loss. **

** Deep learning techniques are increasingly being considered for geological applications where -- much like in computer vision -- the challenges are characterized by high-dimensional spatial data dominated by multipoint statistics. In particular, a novel technique called generative adversarial networks has been recently studied for geological parametrization and synthesis, obtaining very impressive results that are at least qualitatively competitive with previous methods. The method obtains a neural network parametrization of the geology -- so-called a generator -- that is capable of reproducing very complex geological patterns with dimensionality reduction of several orders of magnitude. Subsequent works have addressed the conditioning task, i.e. using the generator to generate realizations honoring spatial observations (hard data). The current approaches, however, do not provide a parametrization of the conditional generation process. In this work, we propose a method to obtain a parametrization for direct generation of conditional realizations. The main idea is to simply extend the existing generator network by stacking a second inference network that learns to perform the conditioning. This inference network is a neural network trained to sample a posterior distribution derived using a Bayesian formulation of the conditioning task. The resulting extended neural network thus provides the conditional parametrization. Our method is assessed on a benchmark image of binary channelized subsurface, obtaining very promising results for a wide variety of conditioning configurations. **

** In contextual continuum-armed bandits, the contexts $x$ and the arms $y$ are both continuous and drawn from high-dimensional spaces. The payoff function to learn $f(x,y)$ does not have a particular parametric form. The literature has shown that for Lipschitz-continuous functions, the optimal regret is $\tilde{O}(T^{\frac{d_x+d_y+1}{d_x+d_y+2}})$, where $d_x$ and $d_y$ are the dimensions of contexts and arms, and thus suffers from the curse of dimensionality. We develop an algorithm that achieves regret $\tilde{O}(T^{\frac{d_x+1}{d_x+2}})$ when $f$ is globally concave in $y$. The global concavity is a common assumption in many applications. The algorithm is based on stochastic approximation and estimates the gradient information in an online fashion. Our results generate a valuable insight that the curse of dimensionality of the arms can be overcome with some mild structures of the payoff function. **

** Intra-class compactness and inter-class separability are crucial indicators to measure the effectiveness of a model to produce discriminative features, where intra-class compactness indicates how close the features with the same label are to each other and inter-class separability indicates how far away the features with different labels are. In this work, we investigate intra-class compactness and inter-class separability of features learned by convolutional networks and propose a Gaussian-based softmax ($\mathcal{G}$-softmax) function that can effectively improve intra-class compactness and inter-class separability. The proposed function is simple to implement and can easily replace the softmax function. We evaluate the proposed $\mathcal{G}$-softmax function on classification datasets (i.e., CIFAR-10, CIFAR-100, and Tiny ImageNet) and on multi-label classification datasets (i.e., MS COCO and NUS-WIDE). The experimental results show that the proposed $\mathcal{G}$-softmax function improves the state-of-the-art models across all evaluated datasets. In addition, analysis of the intra-class compactness and inter-class separability demonstrates the advantages of the proposed function over the softmax function, which is consistent with the performance improvement. More importantly, we observe that high intra-class compactness and inter-class separability are linearly correlated to average precision on MS COCO and NUS-WIDE. This implies that improvement of intra-class compactness and inter-class separability would lead to improvement of average precision. **

** Deep learning has thrived by training on large-scale datasets. However, for continual learning in applications such as robotics, it is critical to incrementally update its model in a sample efficient manner. We propose a novel method that constructs the new class weights from few labelled samples in the support set without back-propagation, relying on our adaptive masked proxies approach. It utilizes multi-resolution average pooling on the output embeddings masked with the label to act as a positive proxy for the new class, while fusing it with the previously learned class signatures. Our proposed method is evaluated on PASCAL-$5^i$ dataset and outperforms the state of the art in the 5-shot semantic segmentation. Unlike previous methods, our proposed approach does not require a second branch to estimate parameters or prototypes, which enables it to be used with 2-stream motion and appearance based segmentation networks. The proposed adaptive proxies allow the method to be used with a continuous data stream. Our online adaptation scheme is evaluated on the DAVIS and FBMS video object segmentation benchmark. We further propose a novel setup for evaluating continual learning of object segmentation which we name incremental PASCAL (iPASCAL) where our method has shown to outperform the baseline method. Code is publicly available at https://github.com/MSiam/AdaptiveMaskedProxies. **