** Datasets often contain biases which unfairly disadvantage certain groups, and classifiers trained on such datasets can inherit these biases. In this paper, we provide a mathematical formulation of how this bias can arise. We do so by assuming the existence of underlying, unknown, and unbiased labels which are overwritten by an agent who intends to provide accurate labels but may have biases against certain groups. Despite the fact that we only observe the biased labels, we are able to show that the bias may nevertheless be corrected by re-weighting the data points without changing the labels. We show, with theoretical guarantees, that training on the re-weighted dataset corresponds to training on the unobserved but unbiased labels, thus leading to an unbiased machine learning classifier. Our procedure is fast and robust and can be used with virtually any learning algorithm. We evaluate on a number of standard machine learning fairness datasets and a variety of fairness notions, finding that our method outperforms standard approaches in achieving fair classification. **

** We focus on the problem of learning a single motor module that can flexibly express a range of behaviors for the control of high-dimensional physically simulated humanoids. To do this, we propose a motor architecture that has the general structure of an inverse model with a latent-variable bottleneck. We show that it is possible to train this model entirely offline to compress thousands of expert policies and learn a motor primitive embedding space. The trained neural probabilistic motor primitive system can perform one-shot imitation of whole-body humanoid behaviors, robustly mimicking unseen trajectories. Additionally, we demonstrate that it is also straightforward to train controllers to reuse the learned motor primitive space to solve tasks, and the resulting movements are relatively naturalistic. To support the training of our model, we compare two approaches for offline policy cloning, including an experience efficient method which we call linear feedback policy cloning. We encourage readers to view a supplementary video ( https://youtu.be/CaDEf-QcKwA ) summarizing our results. **

** Brain-Machine Interfaces (BMIs) have recently emerged as a clinically viable option to restore voluntary movements after paralysis. These devices are based on the ability to extract information about movement intent from neural signals recorded using multi-electrode arrays chronically implanted in the motor cortices of the brain. However, the inherent loss and turnover of recorded neurons requires repeated recalibrations of the interface, which can potentially alter the day-to-day user experience. The resulting need for continued user adaptation interferes with the natural, subconscious use of the BMI. Here, we introduce a new computational approach that decodes movement intent from a low-dimensional latent representation of the neural data. We implement various domain adaptation methods to stabilize the interface over significantly long times. This includes Canonical Correlation Analysis used to align the latent variables across days; this method requires prior point-to-point correspondence of the time series across domains. Alternatively, we match the empirical probability distributions of the latent variables across days through the minimization of their Kullback-Leibler divergence. These two methods provide a significant and comparable improvement in the performance of the interface. However, implementation of an Adversarial Domain Adaptation Network trained to match the empirical probability distribution of the residuals of the reconstructed neural signals outperforms the two methods based on latent variables, while requiring remarkably few data points to solve the domain adaptation problem. **

** Although simple individually, artificial neurons provide state-of-the-art performance when interconnected in deep networks. Unknown to many, there exists an arguably even simpler and more versatile learning mechanism, namely, the Tsetlin Automaton. Merely by means of a single integer as memory, it learns the optimal action in stochastic environments through increment and decrement operations. In this paper, we introduce the Tsetlin Machine, which solves complex pattern recognition problems with easy-to-interpret propositional formulas, composed by a collective of Tsetlin Automata. To eliminate the longstanding problem of vanishing signal-to-noise ratio, the Tsetlin Machine orchestrates the automata using a novel game. Our theoretical analysis establishes that the Nash equilibria of the game align with the propositional formulas that provide optimal pattern recognition accuracy. This translates to learning without local optima, only global ones. We argue that the Tsetlin Machine finds the propositional formula that provides optimal accuracy, with probability arbitrarily close to unity. In five benchmarks, the Tsetlin Machine provides competitive accuracy compared with SVMs, Decision Trees, Random Forests, Naive Bayes Classifier, Logistic Regression, and Neural Networks. The Tsetlin Machine further has an inherent computational advantage since both inputs, patterns, and outputs are expressed as bits, while recognition and learning rely on bit manipulation. The combination of accuracy, interpretability, and computational simplicity makes the Tsetlin Machine a promising tool for a wide range of domains. Being the first of its kind, we believe the Tsetlin Machine will kick-start new paths of research, with a potentially significant impact on the AI field and the applications of AI. **

** Gaussian processes (GPs) provide a powerful non-parametric framework for reasoning over functions. Despite appealing theory, its superlinear computational and memory complexities have presented a long-standing challenge. State-of-the-art sparse variational inference methods trade modeling accuracy against complexity. However, the complexities of these methods still scale superlinearly in the number of basis functions, implying that that sparse GP methods are able to learn from large datasets only when a small model is used. Recently, a decoupled approach was proposed that removes the unnecessary coupling between the complexities of modeling the mean and the covariance functions of a GP. It achieves a linear complexity in the number of mean parameters, so an expressive posterior mean function can be modeled. While promising, this approach suffers from optimization difficulties due to ill-conditioning and non-convexity. In this work, we propose an alternative decoupled parametrization. It adopts an orthogonal basis in the mean function to model the residues that cannot be learned by the standard coupled approach. Therefore, our method extends, rather than replaces, the coupled approach to achieve strictly better performance. This construction admits a straightforward natural gradient update rule, so the structure of the information manifold that is lost during decoupling can be leveraged to speed up learning. Empirically, our algorithm demonstrates significantly faster convergence in multiple experiments. **

** The multi-armed bandit (MAB) model has been widely adopted for studying many practical optimization problems (resource allocation, ad placement, crowdsourcing, etc.) with unknown parameters. The goal of the player here is to maximize the cumulative reward in the face of uncertainty. However, the basic MAB model neglects several important factors of the system in many real-world applications, where multiple arms can be simultaneously played and an arm could sometimes be "sleeping". Besides, ensuring fairness is also a key design concern in practice. To that end, we propose a new Combinatorial Sleeping MAB model with Fairness constraints, called CSMAB-F, aiming to address the aforementioned crucial modeling issues. The objective is now to maximize the reward while satisfying the fairness requirement of a minimum selection fraction for each individual arm. To tackle this new problem, we extend an online learning algorithm, called UCB, to deal with a critical tradeoff between exploitation and exploration and employ the virtual queue technique to properly handle the fairness constraints. By carefully integrating these two techniques, we develop a new algorithm, called Learning with Fairness Guarantee (LFG), for the CSMAB-F problem. Further, we rigorously prove that not only LFG is feasibility-optimal, but it also has a time-average regret upper bounded by \frac{N}{2\eta}+\frac{\beta_1\sqrt{mNT\log{T}}+\beta_2 N}{T}, where N is the total number of arms, m is the maximum number of arms that can be simultaneously played, T is the time horizon, \beta_1 and \beta_2 are constants, and \eta is a design parameter that we can tune. Finally, we perform extensive simulations to corroborate the effectiveness of the proposed algorithm. Interestingly, the simulation results reveal an important tradeoff between the regret and the speed of convergence to a point satisfying the fairness constraints. **

** Automatic emotion recognition (AER) is a challenging task due to the abstract concept and multiple expressions of emotion. Although there is no consensus on a definition, human emotional states usually can be apperceived by auditory and visual systems. Inspired by this cognitive process in human beings, it's natural to simultaneously utilize audio and visual information in AER. However, most traditional fusion approaches only build a linear paradigm, such as feature concatenation and multi-system fusion, which hardly captures complex association between audio and video. In this paper, we introduce factorized bilinear pooling (FBP) to deeply integrate the features of audio and video. Specifically, the features are selected through the embedded attention mechanism from respective modalities to obtain the emotion-related regions. The whole pipeline can be completed in a neural network. Validated on the AFEW database of the audio-video sub-challenge in EmotiW2018, the proposed approach achieves an accuracy of 62.48%, outperforming the state-of-the-art result. **

** We address the problem of optimizing a Brownian motion. We consider a (random) realization $W$ of a Brownian motion with input space in $[0,1]$. Given $W$, our goal is to return an $\epsilon$-approximation of its maximum using the smallest possible number of function evaluations, the sample complexity of the algorithm. We provide an algorithm with sample complexity of order $\log^2(1/\epsilon)$. This improves over previous results of Al-Mharmah and Calvin (1996) and Calvin et al. (2017) which provided only polynomial rates. Our algorithm is adaptive---each query depends on previous values---and is an instance of the optimism-in-the-face-of-uncertainty principle. **

** We present a probabilistic deep learning methodology that enables the construction of predictive data-driven surrogates for stochastic systems. Leveraging recent advances in variational inference with implicit distributions, we put forth a statistical inference framework that enables the end-to-end training of surrogate models on paired input-output observations that may be stochastic in nature, originate from different information sources of variable fidelity, or be corrupted by complex noise processes. The resulting surrogates can accommodate high-dimensional inputs and outputs and are able to return predictions with quantified uncertainty. The effectiveness our approach is demonstrated through a series of canonical studies, including the regression of noisy data, multi-fidelity modeling of stochastic processes, and uncertainty propagation in high-dimensional dynamical systems. **

** Deep latent variable models have seen recent success in many data domains. Lossless compression is an application of these models which, despite having the potential to be highly useful, has yet to be implemented in a practical manner. We present `Bits Back with ANS' (BB-ANS), a scheme to perform lossless compression with latent variable models at a near optimal rate. We demonstrate this scheme by using it to compress the MNIST dataset with a variational auto-encoder model (VAE), achieving compression rates superior to standard methods with only a simple VAE. Given that the scheme is highly amenable to parallelization, we conclude that with a sufficiently high quality generative model this scheme could be used to achieve substantial improvements in compression rate with acceptable running time. We make our implementation available open source at https://github.com/bits-back/bits-back . **

** Recently, neural models pretrained on a language modeling task, such as ELMo (Peters et al., 2017), OpenAI GPT (Radford et al., 2018), and BERT (Devlin et al., 2018), have achieved impressive results on various natural language processing tasks such as question-answering and natural language inference. In this paper, we describe a simple re-implementation of BERT for query-based passage re-ranking. Our system is the state of the art on the TREC-CAR dataset and the top entry in the leaderboard of the MS MARCO passage retrieval task, outperforming the previous state of the art by 27% (relative) in [email protected] The code to reproduce our submission is available at https://github.com/nyu-dl/dl4marco-bert **

** Adding inequality constraints (e.g. boundedness, monotonicity, convexity) into Gaussian processes (GPs) can lead to more realistic stochastic emulators. Due to the truncated Gaussianity of the posterior, its distribution has to be approximated. In this work, we consider Monte Carlo (MC) and Markov chain Monte Carlo (MCMC). However, strictly interpolating the observations may entail expensive computations due to highly restrictive sample spaces. Having (constrained) GP emulators when data are actually noisy is also of interest. We introduce a noise term for the relaxation of the interpolation conditions, and we develop the corresponding approximation of GP emulators under linear inequality constraints. We show with various toy examples that the performance of MC and MCMC samplers improves when considering noisy observations. Finally, on a 5D monotonic example, we show that our framework still provides high effective sample rates with reasonable running times. **

** Linear problems appear in a variety of disciplines and their application for the transmission matrix recovery is one of the most stimulating challenges in biomedical imaging. Its knowledge turns any random media into an optical tool that can focus or transmit an image through disorder. Here, converting an input-output problem into a statistical mechanical formulation, we investigate how inference protocols can learn the transmission couplings by pseudolikelihood maximization. Bridging linear regression and thermodynamics let us propose an innovative framework to pursue the solution of the scattering-riddle. **

** Bayesian optimal design of experiments (BODE) has been successful in acquiring information about a quantity of interest (QoI) which depends on a black-box function. BODE is characterized by sequentially querying the function at specific designs selected by an infill-sampling criterion. However, most current BODE methods operate in specific contexts like optimization, or learning a universal representation of the black-box function. The objective of this paper is to design a BODE for estimating the statistical expectation of a physical response surface. This QoI is omnipresent in uncertainty propagation and design under uncertainty problems. Our hypothesis is that an optimal BODE should be maximizing the expected information gain in the QoI. We represent the information gain from a hypothetical experiment as the Kullback-Liebler (KL) divergence between the prior and the posterior probability distributions of the QoI. The prior distribution of the QoI is conditioned on the observed data and the posterior distribution of the QoI is conditioned on the observed data and a hypothetical experiment. The main contribution of this paper is the derivation of a semi-analytic mathematical formula for the expected information gain about the statistical expectation of a physical response. The developed BODE is validated on synthetic functions with varying number of input-dimensions. We demonstrate the performance of the methodology on a steel wire manufacturing problem. **

** The Laplace approximation has been one of the workhorses of Bayesian inference. It often delivers good approximations in practice despite the fact that it does not strictly take into account where the volume of posterior density lies. Variational approaches avoid this issue by explicitly minimising the Kullback-Leibler divergence DKL between a postulated posterior and the true (unnormalised) logarithmic posterior. However, they rely on a closed form DKL in order to update the variational parameters. To address this, stochastic versions of variational inference have been devised that approximate the intractable DKL with a Monte Carlo average. This approximation allows calculating gradients with respect to the variational parameters. However, variational methods often postulate a factorised Gaussian approximating posterior. In doing so, they sacrifice a-posteriori correlations. In this work, we propose a method that combines the Laplace approximation with the variational approach. The advantages are that we maintain: applicability on non-conjugate models, posterior correlations and a reduced number of free variational parameters. Numerical experiments demonstrate improvement over the Laplace approximation and variational inference with factorised Gaussian posteriors. **