** Mixture models trained via EM are among the simplest, most widely used and well understood latent variable models in the machine learning literature. Surprisingly, these models have been hardly explored in text generation applications such as machine translation. In principle, they provide a latent variable to control generation and produce a diverse set of hypotheses. In practice, however, mixture models are prone to degeneracies---often only one component gets trained or the latent variable is simply ignored. We find that disabling dropout noise in responsibility computation is critical to successful training. In addition, the design choices of parameterization, prior distribution, hard versus soft EM and online versus offline assignment can dramatically affect model performance. We develop an evaluation protocol to assess both quality and diversity of generations against multiple references, and provide an extensive empirical study of several mixture model variants. Our analysis shows that certain types of mixture models are more robust and offer the best trade-off between translation quality and diversity compared to variational models and diverse decoding approaches. **

** Optimizing deep neural networks is largely thought to be an empirical process, requiring manual tuning of several parameters, such as learning rate, weight decay, and dropout rate. Arguably, the learning rate is the most important of these to tune, and this has gained more attention in recent works. In this paper, we propose a novel method to compute the learning rate for training deep neural networks. We derive a theoretical framework to compute learning rates dynamically, and then show experimental results on standard datasets and architectures to demonstrate the efficacy of our approach. **

** In the first part of the paper, we have studied the computational privacy risks in distributed computing protocols against local or global dynamics eavesdroppers, and proposed a Privacy-Preserving-Summation-Consistent (PPSC) mechanism as a generic privacy encryption subroutine for consensus-based distributed computations. In this part of this paper, we show that the conventional deterministic and random gossip algorithms can be used to realize the PPSC mechanism over a given network. At each time step, a node is selected to interact with one of its neighbors via deterministic or random gossiping. Such node generates a random number as its new state, and sends the subtraction between its current state and that random number to the neighbor; then the neighbor updates its state by adding the received value to its current state. We establish concrete privacy-preservation conditions by proving the impossibility for the reconstruction of the network input from the output of the gossip-based PPSC mechanism against eavesdroppers with full network knowledge, and by showing that the PPSC mechanism can achieve differential privacy at arbitrary privacy levels. The convergence is characterized explicitly and analytically for both deterministic and randomized gossiping, which is essentially achieved in a finite number of steps. Additionally, we illustrate that the proposed algorithms can be easily made adaptive in real-world applications by making realtime trade-offs between resilience against node dropout or communication failure and privacy preservation capabilities. **

** While modern convolutional neural networks achieve outstanding accuracy on many image classification tasks, they are, compared to humans, much more sensitive to image degradation. Here, we describe a variant of Batch Normalization, LocalNorm, that regularizes the normalization layer in the spirit of Dropout while dynamically adapting to the local image intensity and contrast at test-time. We show that the resulting deep neural networks are much more resistant to noise-induced image degradation, improving accuracy by up to three times, while achieving the same or slightly better accuracy on non-degraded classical benchmarks. In computational terms, LocalNorm adds negligible training cost and little or no cost at inference time, and can be applied to already-trained networks in a straightforward manner. **

** In many real-world learning scenarios, features are only acquirable at a cost constrained under a budget. In this paper, we propose a novel approach for cost-sensitive feature acquisition at the prediction-time. The suggested method acquires features incrementally based on a context-aware feature-value function. We formulate the problem in the reinforcement learning paradigm, and introduce a reward function based on the utility of each feature. Specifically, MC dropout sampling is used to measure expected variations of the model uncertainty which is used as a feature-value function. Furthermore, we suggest sharing representations between the class predictor and value function estimator networks. The suggested approach is completely online and is readily applicable to stream learning setups. The solution is evaluated on three different datasets including the well-known MNIST dataset as a benchmark as well as two cost-sensitive datasets: Yahoo Learning to Rank and a dataset in the medical domain for diabetes classification. According to the results, the proposed method is able to efficiently acquire features and make accurate predictions. **

** Dropout regularization of deep neural networks has been a mysterious yet effective tool to prevent overfitting. Explanations for its success range from the prevention of "co-adapted" weights to it being a form of cheap Bayesian inference. We propose a novel framework for understanding multiplicative noise in neural networks, considering continuous distributions as well as Bernoulli noise (i.e. dropout). We show that multiplicative noise induces structured shrinkage priors on a network's weights. We derive the equivalence through reparametrization properties of scale mixtures and without invoking any approximations. Given the equivalence, we then show that dropout's Monte Carlo training objective approximates marginal MAP estimation. We leverage these insights to propose a novel shrinkage framework for resnets, terming the prior "automatic depth determination" as it is the natural analog of automatic relevance determination for network depth. Lastly, we investigate two inference strategies that improve upon the aforementioned MAP approximation in regression benchmarks. **

** Dropout is a popular regularization technique in neural networks. Yet, the reason for its success is still not fully understood. This paper provides a new interpretation of Dropout from a frame theory perspective. By drawing a connection to recent developments in analog channel coding, we suggest that for a certain family of autoencoders with a linear encoder, the minimizer of an optimization with dropout regularization on the encoder is an equiangular tight frame (ETF). Since this optimization is non-convex, we add another regularization that promotes such structures by minimizing the cross-correlation between filters in the network. We demonstrate its applicability in convolutional and fully connected layers in both feed-forward and recurrent networks. All these results suggest that there is indeed a relationship between dropout and ETF structure of the regularized linear operations. **

** Multi-layer neural networks have lead to remarkable performance on many kinds of benchmark tasks in text, speech and image processing. Nonlinear parameter estimation in hierarchical models is known to be subject to overfitting and misspecification. One approach to these estimation and related problems (local minima, colinearity, feature discovery etc.) is called Dropout (Hinton, et al 2012, Baldi et al 2016). The Dropout algorithm removes hidden units according to a Bernoulli random variable with probability $p$ prior to each update, creating random "shocks" to the network that are averaged over updates. In this paper we will show that Dropout is a special case of a more general model published originally in 1990 called the Stochastic Delta Rule, or SDR (Hanson, 1990). SDR redefines each weight in the network as a random variable with mean $\mu_{w_{ij}}$ and standard deviation $\sigma_{w_{ij}}$. Each weight random variable is sampled on each forward activation, consequently creating an exponential number of potential networks with shared weights. Both parameters are updated according to prediction error, thus resulting in weight noise injections that reflect a local history of prediction error and local model averaging. SDR therefore implements a more sensitive local gradient-dependent simulated annealing per weight converging in the limit to a Bayes optimal network. Tests on standard benchmarks (CIFAR) using a modified version of DenseNet shows the SDR outperforms standard Dropout in test error by approx. $17\%$ with DenseNet-BC 250 on CIFAR-100 and approx. $12-14\%$ in smaller networks. We also show that SDR reaches the same accuracy that Dropout attains in 100 epochs in as few as 35 epochs. **

** We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose approach for uncertainty representation and calibration in deep learning. Stochastic Weight Averaging (SWA), which computes the first moment of stochastic gradient descent (SGD) iterates with a modified learning rate schedule, has recently been shown to improve generalization in deep learning. With SWAG, we fit a Gaussian using the SWA solution as the first moment and a low rank plus diagonal covariance also derived from the SGD iterates, forming an approximate posterior distribution over neural network weights; we then sample from this Gaussian distribution to perform Bayesian model averaging. We empirically find that SWAG approximates the shape of the true posterior, in accordance with results describing the stationary distribution of SGD iterates. Moreover, we demonstrate that SWAG performs well on a wide variety of computer vision tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including MC dropout, KFAC Laplace, and temperature scaling. **

** We explore the use of neural networks trained with dropout in predicting epileptic seizures from electroencephalographic data (scalp EEG). The input to the neural network is a 126 feature vector containing 9 features for each of the 14 EEG channels obtained over 1-second, non-overlapping windows. The models in our experiments achieved high sensitivity and specificity on patient records not used in the training process. This is demonstrated using leave-one-out-cross-validation across patient records, where we hold out one patient's record as the test set and use all other patients' records for training; repeating this procedure for all patients in the database. **