“机器学习是近20多年兴起的一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。机器学习理论主要是设计和分析一些让 可以自动“ 学习”的算法。机器学习算法是一类从数据中自动分析获得规律，并利用规律对未知数据进行预测的算法。因为学习算法中涉及了大量的统计学理论，机器学习与统计推断学联系尤为密切，也被称为统计学习理论。算法设计方面，机器学习理论关注可以实现的，行之有效的学习算法。很多 推论问题属于 无程序可循难度，所以部分的机器学习研究是开发容易处理的近似算法。” ——中文维基百科

This paper discusses the problem of causal query in observational data with hidden variables, with the aim of seeking the change of an outcome when "manipulating" a variable while given a set of plausible confounding variables which affect the manipulated variable and the outcome. Such an "experiment on data" to estimate the causal effect of the manipulated variable is useful for validating an experiment design using historical data or for exploring confounders when studying a new relationship. However, existing data-driven methods for causal effect estimation face some major challenges, including poor scalability with high dimensional data, low estimation accuracy due to heuristics used by the global causal structure learning algorithms, and the assumption of causal sufficiency when hidden variables are inevitable in data. In this paper, we develop a theorem for using local search to find a superset of the adjustment (or confounding) variables for causal effect estimation from observational data under a realistic pretreatment assumption. The theorem ensures that the unbiased estimate of causal effect is included in the set of causal effects estimated by the superset of adjustment variables. Based on the developed theorem, we propose a data-driven algorithm for causal query. Experiments show that the proposed algorithm is faster and produces better causal effect estimation than an existing data-driven causal effect estimation method with hidden variables. The causal effects estimated by the proposed algorithm are as accurate as those by the state-of-the-art methods using domain knowledge.

We estimate the general influence functions for spatio-temporal Hawkes processes using a tensor recovery approach by formulating the location dependent influence function that captures the influence of historical events as a tensor kernel. We assume a low-rank structure for the tensor kernel and cast the estimation problem as a convex optimization problem using the Fourier transformed nuclear norm (TNN). We provide theoretical performance guarantees for our approach and present an algorithm to solve the optimization problem. We demonstrate the efficiency of our estimation with numerical simulations.

Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems. Recently, there has been significant interest in utilizing this information in the context of deep neural networks; however, relatively little is known about the quality of existing approximations in this context. Our work examines this question, identifies issues with existing approaches, and proposes a method called WoodFisher to compute a faithful and efficient estimate of the inverse Hessian. Our main application is to neural network compression, where we build on the classic Optimal Brain Damage/Surgeon framework. We demonstrate that WoodFisher significantly outperforms popular state-of-the-art methods for one-shot pruning. Further, even when iterative, gradual pruning is considered, our method results in a gain in test accuracy over the state-of-the-art approaches, for pruning popular neural networks (like ResNet-50, MobileNetV1) trained on standard image classification datasets such as ImageNet ILSVRC. We examine how our method can be extended to take into account first-order information, as well as illustrate its ability to automatically set layer-wise pruning thresholds and perform compression in the limited-data regime. The code is available at the following link, https://github.com/IST-DASLab/WoodFisher.

Understanding the space of probability measures on a metric space equipped with a Wasserstein distance is one of the fundamental questions in mathematical analysis. The Wasserstein metric has received a lot of attention in the machine learning community especially for its principled way of comparing distributions. In this work, we use a permutation invariant network to map samples from probability measures into a low-dimensional space such that the Euclidean distance between the encoded samples reflects the Wasserstein distance between probability measures. We show that our network can generalize to correctly compute distances between unseen densities. We also show that these networks can learn the first and the second moments of probability distributions.

Automating molecular design using deep reinforcement learning (RL) has the potential to greatly accelerate the search for novel materials. Despite recent progress on leveraging graph representations to design molecules, such methods are fundamentally limited by the lack of three-dimensional (3D) information. In light of this, we propose a novel actor-critic architecture for 3D molecular design that can generate molecular structures unattainable with previous approaches. This is achieved by exploiting the symmetries of the design process through a rotationally covariant state-action representation based on a spherical harmonics series expansion. We demonstrate the benefits of our approach on several 3D molecular design tasks, where we find that building in such symmetries significantly improves generalization and the quality of generated molecules.

Semi-supervised learning (SSL) provides an effective means of leveraging unlabeled data to improve a model's performance. In this paper, we demonstrate the power of a simple combination of two common SSL methods: consistency regularization and pseudo-labeling. Our algorithm, FixMatch, first generates pseudo-labels using the model's predictions on weakly-augmented unlabeled images. For a given image, the pseudo-label is only retained if the model produces a high-confidence prediction. The model is then trained to predict the pseudo-label when fed a strongly-augmented version of the same image. Despite its simplicity, we show that FixMatch achieves state-of-the-art performance across a variety of standard semi-supervised learning benchmarks, including 94.93% accuracy on CIFAR-10 with 250 labels and 88.61% accuracy with 40 -- just 4 labels per class. Since FixMatch bears many similarities to existing SSL methods that achieve worse performance, we carry out an extensive ablation study to tease apart the experimental factors that are most important to FixMatch's success. We make our code available at https://github.com/google-research/fixmatch.

Since the Lipschitz properties of convolutional neural network (CNN) are widely considered to be related to adversarial robustness, we theoretically characterize the $\ell_1$ norm and $\ell_\infty$ norm of 2D multi-channel convolutional layers and provide efficient methods to compute the exact $\ell_1$ norm and $\ell_\infty$ norm. Based on our theorem, we propose a novel regularization method termed norm decay, which can effectively reduce the norms of CNN layers. Experiments show that norm-regularization methods, including norm decay, weight decay, and singular value clipping, can improve generalization of CNNs. However, we are surprised to find that they can slightly hurt adversarial robustness. Furthermore, we compute the norms of layers in the CNNs trained with three different adversarial training frameworks and find that adversarially robust CNNs have comparable or even larger norms than their non-adversarially robust counterparts. Moreover, we prove that under a mild assumption, adversarially robust classifiers can be achieved with neural networks and an adversarially robust neural network can have arbitrarily large Lipschitz constant. For these reasons, enforcing small norms of CNN layers may be neither effective nor necessary in achieving adversarial robustness. Our code is available at https://github.com/youweiliang/norm_robustness.

Least squares Monte Carlo methods are a popular numerical approximation method for solving stochastic control problems. Based on dynamic programming, their key feature is the approximation of the conditional expectation of future rewards by linear least squares regression. Hence, the choice of basis functions is crucial for the accuracy of the method. Earlier work by some of us [Belomestny, Schoenmakers, Spokoiny, Zharkynbay. Commun.~Math.~Sci., 18(1):109-121, 2020] proposes to \emph{reinforce} the basis functions in the case of optimal stopping problems by already computed value functions for later times, thereby considerably improving the accuracy with limited additional computational cost. We extend the reinforced regression method to a general class of stochastic control problems, while considerably improving the method's efficiency, as demonstrated by substantial numerical examples as well as theoretical analysis.

Authentication is the task of confirming the matching relationship between a data instance and a given identity. Typical examples of authentication problems include face recognition and person re-identification. Data-driven authentication could be affected by undesired biases, i.e., the models are often trained in one domain (e.g., for people wearing spring outfits) while applied in other domains (e.g., they change the clothes to summer outfits). Previous works have made efforts to eliminate domain-difference. They typically assume domain annotations are provided, and all the domains share classes. However, for authentication, there could be a large number of domains shared by different identities/classes, and it is impossible to annotate these domains exhaustively. It could make domain-difference challenging to model and eliminate. In this paper, we propose a domain-agnostic method that eliminates domain-difference without domain labels. We alternately perform latent domain discovery and domain-difference elimination until our model no longer detects domain-difference. In our approach, the latent domains are discovered by learning the heterogeneous predictive relationships between inputs and outputs. Then domain-difference is eliminated in both class-dependent and class-independent spaces to improve robustness of elimination. We further extend our method to a meta-learning framework to pursue more thorough domain-difference elimination. Comprehensive empirical evaluation results are provided to demonstrate the effectiveness and superiority of our proposed method.

Projection robust Wasserstein (PRW) distance, or Wasserstein projection pursuit (WPP), is a robust variant of the Wasserstein distance. Recent work suggests that this quantity is more robust than the standard Wasserstein distance, in particular when comparing probability measures in high-dimensions. However, it is ruled out for practical application because the optimization model is essentially non-convex and non-smooth which makes the computation intractable. Our contribution in this paper is to revisit the original motivation behind WPP/PRW, but take the hard route of showing that, despite its non-convexity and lack of nonsmoothness, and even despite some hardness results proved by~\citet{Niles-2019-Estimation} in a minimax sense, the original formulation for PRW/WPP \textit{can} be efficiently computed in practice using Riemannian optimization, yielding in relevant cases better behavior than its convex relaxation. More specifically, we provide three simple algorithms with solid theoretical guarantee on their complexity bound (one in the appendix), and demonstrate their effectiveness and efficiency by conducing extensive experiments on synthetic and real data. This paper provides a first step into a computational theory of the PRW distance and provides the links between optimal transport and Riemannian optimization.

Deterministic neural networks (NNs) are increasingly being deployed in safety critical domains, where calibrated, robust, and efficient measures of uncertainty are crucial. In this paper, we propose a novel method for training non-Bayesian NNs to estimate a continuous target as well as its associated evidence in order to learn both aleatoric and epistemic uncertainty. We accomplish this by placing evidential priors over the original Gaussian likelihood function and training the NN to infer the hyperparameters of the evidential distribution. We additionally impose priors during training such that the model is regularized when its predicted evidence is not aligned with the correct output. Our method does not rely on sampling during inference or on out-of-distribution (OOD) examples for training, thus enabling efficient and scalable uncertainty learning. We demonstrate learning well-calibrated measures of uncertainty on various benchmarks, scaling to complex computer vision tasks, as well as robustness to adversarial and OOD test samples.

Despite the success of deep learning in domains such as image, voice, and graphs, there has been little progress in deep representation learning for domains without a known structure between features. For instance, a tabular dataset of different demographic and clinical factors where the feature interactions are not given as a prior. In this paper, we propose Group-Connected Multilayer Perceptron (GMLP) networks to enable deep representation learning in these domains. GMLP is based on the idea of learning expressive feature combinations (groups) and exploiting them to reduce the network complexity by defining local group-wise operations. During the training phase, GMLP learns a sparse feature grouping matrix using temperature annealing softmax with an added entropy loss term to encourage the sparsity. Furthermore, an architecture is suggested which resembles binary trees, where group-wise operations are followed by pooling operations to combine information; reducing the number of groups as the network grows in depth. To evaluate the proposed method, we conducted experiments on different real-world datasets covering various application areas. Additionally, we provide visualizations on MNIST and synthesized data. According to the results, GMLP is able to successfully learn and exploit expressive feature combinations and achieve state-of-the-art classification performance on different datasets.

In this paper we revisit one of the classical statistical problems, the so-called sparse maximum-likelihood (ML) linear regression. As a way of attacking this type of regression, we present a novel CLuP mechanism that to a degree relies on the \bl{\textbf{Random Duality Theory (RDT)}} based algorithmic machinery that we recently introduced in \cite{Stojnicclupint19,Stojnicclupcmpl19,Stojnicclupplt19,Stojniccluplargesc20,Stojniccluprephased20}. After the initial success that the CLuP exhibited in achieving the exact ML performance while maintaining excellent computational complexity related properties in MIMO ML detection in \cite{Stojnicclupint19,Stojnicclupcmpl19,Stojnicclupplt19}, one would naturally expect that a similar type of success can be achieved in other ML considerations. The results that we present here confirm that such an expectation is indeed reasonable. In particular, within the sparse regression context, the introduced CLuP mechanism indeed turns out to be able to \bl{\textbf{\emph{achieve the ideal ML performance}}}. Moreover, it can substantially outperform some of the most prominent earlier state of the art algorithmic concepts, among them even the variants of the famous LASSO and SOCP from \cite{StojnicPrDepSocp10,StojnicGenLasso10,StojnicGenSocp10}. Also, our recent results presented in \cite{Stojniccluplargesc20,Stojniccluprephased20} showed that the CLuP has excellent \bl{\textbf{\emph{large-scale}}} and the so-called \bl{\textbf{\emph{rephasing}}} abilities. Since such large-scale algorithmic features are possibly even more desirable within the sparse regression context we here also demonstrate that the basic CLuP ideas can be reformulated to enable solving with a relative ease the regression problems with \bl{\textbf{\emph{several thousands}}} of unknowns.

Top