“机器学习是近20多年兴起的一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。机器学习理论主要是设计和分析一些让 可以自动“ 学习”的算法。机器学习算法是一类从数据中自动分析获得规律，并利用规律对未知数据进行预测的算法。因为学习算法中涉及了大量的统计学理论，机器学习与统计推断学联系尤为密切，也被称为统计学习理论。算法设计方面，机器学习理论关注可以实现的，行之有效的学习算法。很多 推论问题属于 无程序可循难度，所以部分的机器学习研究是开发容易处理的近似算法。”

——中文维基百科

Humans are able to perform a myriad of sophisticated tasks by drawing upon skills acquired through prior experience. For autonomous agents to have this capability, they must be able to extract reusable skills from past experience that can be recombined in new ways for subsequent tasks. Furthermore, when controlling complex high-dimensional morphologies, such as humanoid bodies, tasks often require coordination of multiple skills simultaneously. Learning discrete primitives for every combination of skills quickly becomes prohibitive. Composable primitives that can be recombined to create a large variety of behaviors can be more suitable for modeling this combinatorial explosion. In this work, we propose multiplicative compositional policies (MCP), a method for learning reusable motor skills that can be composed to produce a range of complex behaviors. Our method factorizes an agent's skills into a collection of primitives, where multiple primitives can be activated simultaneously via multiplicative composition. This flexibility allows the primitives to be transferred and recombined to elicit new behaviors as necessary for novel tasks. We demonstrate that MCP is able to extract composable skills for highly complex simulated characters from pre-training tasks, such as motion imitation, and then reuse these skills to solve challenging continuous control tasks, such as dribbling a soccer ball to a goal, and picking up an object and transporting it to a target location.

Neural network training is usually accomplished by solving a non-convex optimization problem using stochastic gradient descent. Although one optimizes over the networks parameters, the loss function generally only depends on the realization of a neural network, i.e. the function it computes. Studying the functional optimization problem over the space of realizations can open up completely new ways to understand neural network training. In particular, usual loss functions like the mean squared error are convex on sets of neural network realizations, which themselves are non-convex. Note, however, that each realization has many different, possibly degenerate, parametrizations. In particular, a local minimum in the parametrization space needs not correspond to a local minimum in the realization space. To establish such a connection, inverse stability of the realization map is required, meaning that proximity of realizations must imply proximity of corresponding parametrizations. In this paper we present pathologies which prevent inverse stability in general, and proceed to establish a restricted set of parametrizations on which we have inverse stability w.r.t. to a Sobolev norm. Furthermore, we show that by optimizing over such restricted sets, it is still possible to learn any function, which can be learned by optimization over unrestricted sets. While most of this paper focuses on shallow networks, none of methods used are, in principle, limited to shallow networks, and it should be possible to extend them to deep neural networks.

Motivated by the application of real-time pricing in e-commerce platforms, we consider the problem of revenue-maximization in a setting where the seller can leverage contextual information describing the customer's history and the product's type to predict her valuation of the product. However, her true valuation is unobservable to the seller, only binary outcome in the form of success-failure of a transaction is observed. Unlike in usual contextual bandit settings, the optimal price/arm given a covariate in our setting is sensitive to the detailed characteristics of the residual uncertainty distribution. We develop a semi-parametric model in which the residual distribution is non-parametric and provide the first algorithm which learns both regression parameters and residual distribution with $\tilde O(\sqrt{n})$ regret. We empirically test a scalable implementation of our algorithm and observe good performance.

We attempt to interpret how adversarially trained convolutional neural networks (AT-CNNs) recognize objects. We design systematic approaches to interpret AT-CNNs in both qualitative and quantitative ways and compare them with normally trained models. Surprisingly, we find that adversarial training alleviates the texture bias of standard CNNs when trained on object recognition tasks, and helps CNNs learn a more shape-biased representation. We validate our hypothesis from two aspects. First, we compare the salience maps of AT-CNNs and standard CNNs on clean images and images under different transformations. The comparison could visually show that the prediction of the two types of CNNs is sensitive to dramatically different types of features. Second, to achieve quantitative verification, we construct additional test datasets that destroy either textures or shapes, such as style-transferred version of clean data, saturated images and patch-shuffled ones, and then evaluate the classification accuracy of AT-CNNs and normal CNNs on these datasets. Our findings shed some light on why AT-CNNs are more robust than those normally trained ones and contribute to a better understanding of adversarial training over CNNs from an interpretation perspective.

State-of-the-art deep learning methods have shown a remarkable capacity to model complex data domains, but struggle with geospatial data. In this paper, we introduce SpaceGAN, a novel generative model for geospatial domains that learns neighbourhood structures through spatial conditioning. We propose to enhance spatial representation beyond mere spatial coordinates, by conditioning each data point on feature vectors of its spatial neighbours, thus allowing for a more flexible representation of the spatial structure. To overcome issues of training convergence, we employ a metric capturing the loss in local spatial autocorrelation between real and generated data as stopping criterion for SpaceGAN parametrization. This way, we ensure that the generator produces synthetic samples faithful to the spatial patterns observed in the input. SpaceGAN is successfully applied for data augmentation and outperforms compared to other methods of synthetic spatial data generation. Finally, we propose an ensemble learning framework for the geospatial domain, taking augmented SpaceGAN samples as training data for a set of ensemble learners. We empirically show the superiority of this approach over conventional ensemble learning approaches and rivaling spatial data augmentation methods, using synthetic and real-world prediction tasks. Our findings suggest that SpaceGAN can be used as a tool for (1) artificially inflating sparse geospatial data and (2) improving generalization of geospatial models.

Estimating pairwise interaction effects, i.e., the difference between the joint effect and the sum of marginal effects of two input features, with uncertainty properly quantified, is centrally important in science applications. We propose a non-parametric probabilistic method for detecting interaction effects of unknown form. First, the relationship between the features and the output is modelled using a Bayesian neural network, leveraging on the representation capability of deep neural networks. Second, interaction effects and their uncertainty are estimated from the trained model. For the second step we propose a simple and intuitive global interaction measure: Expected Integrated Hessian (EIH), whose uncertainty can be estimated using the predictive uncertainty. Two important properties of the Bayesian EIH are: 1. interaction estimation error is upper bounded by the prediction error of the neural network, which ensures interaction detection can be improved by training a more accurate model; 2. uncertainty of the Bayesian EIH is well-calibrated provided the prediction uncertainty is calibrated, which is easier to test and guarantee. The method outperforms the available alternatives on simulated and real-world data, and we demonstrate its ability to detect interpretable interactions also between higher-level features (at deeper layers of the neural network).

Hyperbolic embeddings have recently gained attention in machine learning due to their ability to represent hierarchical data more accurately and succinctly than their Euclidean analogues. However, multi-relational knowledge graphs often exhibit multiple simultaneous hierarchies, which current hyperbolic models do not capture. To address this, we propose a model that embeds multi-relational graph data in the Poincar\'e ball model of hyperbolic space. Our Multi-Relational Poincar\'e model (MuRP) learns relation-specific parameters to transform entity embeddings by M\"obius matrix-vector multiplication and M\"obius addition. Experiments on the hierarchical WN18RR knowledge graph show that our multi-relational Poincar\'e embeddings outperform their Euclidean counterpart and existing embedding methods on the link prediction task, particularly at lower dimensionality.

We propose a Bayesian optimization method over sets, to minimize a black-box function that can take a set as single input. Because set inputs are permutation-invariant and variable-length, traditional Gaussian process-based Bayesian optimization strategies which assume vector inputs can fall short. To address this, we develop a Bayesian optimization method with \emph{set kernel} that is used to build surrogate functions. This kernel accumulates similarity over set elements to enforce permutation-invariance and permit sets of variable size, but this comes at a greater computational cost. To reduce this burden, we propose a more efficient probabilistic approximation which we prove is still positive definite and is an unbiased estimator of the true set kernel. Finally, we present several numerical experiments which demonstrate that our method outperforms other methods in various applications.

In this paper, we present a general framework to scale graph autoencoders (AE) and graph variational autoencoders (VAE). This framework leverages graph degeneracy concepts to train models only from a dense subset of nodes instead of using the entire graph. Together with a simple yet effective propagation mechanism, our approach significantly improves scalability and training speed while preserving performance. We evaluate and discuss our method on several variants of existing graph AE and VAE, providing the first application of these models to large graphs with up to millions of nodes and edges. We achieve empirically competitive results w.r.t. several popular scalable node embedding methods, which emphasizes the relevance of pursuing further research towards more scalable graph AE and VAE.

Recent years have witnessed growing interests in designing efficient neural networks and neural architecture search (NAS). Although remarkable efficiency and accuracy have been achieved, existing expert designed and NAS models neglect the fact that input instances are of varying complexity thus different amount of computation is required. Inference with a fixed model that processes all instances through the same transformations would waste plenty of computational resources. Therefore, customizing the model capacity in an instance-aware manner is highly demanded. To address this issue, we propose an Instance-aware Selective Branching Network-ISBNet, which supports efficient instance-level inference by selectively bypassing transformation branches of insignificant importance weight. These weights are determined dynamically by accompanying lightweight hypernetworks SelectionNets and further recalibrated by gumbel-softmax for sparse branch selection. Extensive experiments show that ISBNet achieves extremely efficient inference in terms of parameter size and FLOPs comparing to existing networks. For example, ISBNet takes only 8.03% parameters and 30.60% FLOPs of the state-of-the-art efficient network ShuffleNetV2 with comparable accuracy.

Performing knowledge transfer from a large teacher network to a smaller student is a popular task in modern deep learning applications. However, due to growing dataset sizes and stricter privacy regulations, it is increasingly common not to have access to the data that was used to train the teacher. We propose a novel method which trains a student to match the predictions of its teacher without using any data or metadata. We achieve this by training an adversarial generator to search for images on which the student poorly matches the teacher, and then using them to train the student. Our resulting student closely approximates its teacher for simple datasets like SVHN, and on CIFAR10 we improve on the state-of-the-art for few-shot distillation (with 100 images per class), despite using no data. Finally, we also propose a metric to quantify the degree of belief matching between teacher and student in the vicinity of decision boundaries, and observe a significantly higher match between our zero-shot student and the teacher, than between a student distilled with real data and the teacher. Code available at: https://github.com/polo5/ZeroShotKnowledgeTransfer