Despite its success in a wide range of applications, characterizing the generalization properties of stochastic gradient descent (SGD) in non-convex deep learning problems is still an important challenge. While modeling the trajectories of SGD via stochastic differential equations (SDE) under heavy-tailed gradient noise has recently shed light over several peculiar characteristics of SGD, a rigorous treatment of the generalization properties of such SDEs in a learning theoretical framework is still missing. Aiming to bridge this gap, in this paper, we prove generalization bounds for SGD under the assumption that its trajectories can be well-approximated by a \emph{Feller process}, which defines a rich class of Markov processes that include several recent SDE representations (both Brownian or heavy-tailed) as its special case. We show that the generalization error can be controlled by the \emph{Hausdorff dimension} of the trajectories, which is intimately linked to the tail behavior of the driving process. Our results imply that heavier-tailed processes should achieve better generalization; hence, the tail-index of the process can be used as a notion of "capacity metric". We support our theory with experiments on deep neural networks illustrating that the proposed capacity metric accurately estimates the generalization error, and it does not necessarily grow with the number of parameters unlike the existing capacity metrics in the literature.

### 相关内容

In this work, our prime objective is to study the phenomena of quantum chaos and complexity in the machine learning dynamics of Quantum Neural Network (QNN). A Parameterized Quantum Circuits (PQCs) in the hybrid quantum-classical framework is introduced as a universal function approximator to perform optimization with Stochastic Gradient Descent (SGD). We employ a statistical and differential geometric approach to study the learning theory of QNN. The evolution of parametrized unitary operators is correlated with the trajectory of parameters in the Diffusion metric. We establish the parametrized version of Quantum Complexity and Quantum Chaos in terms of physically relevant quantities, which are not only essential in determining the stability, but also essential in providing a very significant lower bound to the generalization capability of QNN. We explicitly prove that when the system executes limit cycles or oscillations in the phase space, the generalization capability of QNN is maximized. Finally, we have determined the generalization capability bound on the variance of parameters of the QNN in a steady state condition using Cauchy Schwartz Inequality.

Model-free reinforcement learning attempts to find an optimal control action for an unknown dynamical system by directly searching over the parameter space of controllers. The convergence behavior and statistical properties of these approaches are often poorly understood because of the nonconvex nature of the underlying optimization problems and the lack of exact gradient computation. In this paper, we take a step towards demystifying the performance and efficiency of such methods by focusing on the standard infinite-horizon linear quadratic regulator problem for continuous-time systems with unknown state-space parameters. We establish exponential stability for the ordinary differential equation (ODE) that governs the gradient-flow dynamics over the set of stabilizing feedback gains and show that a similar result holds for the gradient descent method that arises from the forward Euler discretization of the corresponding ODE. We also provide theoretical bounds on the convergence rate and sample complexity of the random search method with two-point gradient estimates. We prove that the required simulation time for achieving $\epsilon$-accuracy in the model-free setup and the total number of function evaluations both scale as $\log \, (1/\epsilon)$.

We present an intimate connection among the following fields: (a) distributed local algorithms: coming from the area of computer science, (b) finitary factors of iid processes: coming from the area of analysis of randomized processes, (c) descriptive combinatorics: coming from the area of combinatorics and measure theory. In particular, we study locally checkable labellings in grid graphs from all three perspectives. Most of our results are for the perspective (b) where we prove time hierarchy theorems akin to those known in the field (a) [Chang, Pettie FOCS 2017]. This approach that borrows techniques from the fields (a) and (c) implies a number of results about possible complexities of finitary factor solutions. Among others, it answers three open questions of [Holroyd et al. Annals of Prob. 2017] or the more general question of [Brandt et al. PODC 2017] who asked for a formal connection between the fields (a) and (b). In general, we hope that our treatment will help to view all three perspectives as a part of a common theory of locality, in which we follow the insightful paper of [Bernshteyn 2020+] .

The estimation of probability density functions (PDF) using approximate maps is a fundamental building block in computational probability. We consider forward problems in uncertainty quantification: the inputs or the parameters of an otherwise deterministic model are random with a known distribution. The scalar quantity of interest is a fixed function of the parameters, and can therefore be considered as a random variable as a well. Often, the quantity of interest map is not explicitly known, and so the computational problem is to find its right'' approximation (surrogate model). For the goal of approximating the {\em moments} of the quantity of interest, there is a developed body of research. One widely popular approach is generalized Polynomial Chaos (gPC) and its many variants, which approximate moments with spectral accuracy. But can the PDF of the quantity of interest be approximated with spectral accuracy? This is not directly implied by spectrally accurate moment estimation. In this paper, we prove convergence rates for PDFs using collocation and Galerkin gPC methods with Legendre polynomials in all dimensions. In particular, exponential convergence of the densities is guaranteed for analytic quantities of interest. In one dimension, we provide more refined results with stronger convergence rates, as well as an alternative proof strategy based on optimal-transport techniques.

Gaussian functions are commonly used in different fields, many real signals can be modeled into such form. Research aiming to obtain a precise fitting result for these functions is very meaningful. This manuscript intends to introduce a new algorithm used to estimate the full parameters of the Gaussian-shaped function. It is basically a weighting method, starting from Caruana's method, while the selection of weighting factors is from the statistics view and based on the estimation of the confidence level for the samples. Tests designed for comparison with current similar methods have been conducted. The simulation results indicate a good performance for this new method, mainly in precision and robustness.

When building Deep Learning (DL) models, data scientists and software engineers manage the trade-off between their accuracy, or any other suitable success criteria, and their complexity. In an environment with high computational power, a common practice is making the models go deeper by designing more sophisticated architectures. However, in the context of mobile devices, which possess less computational power, keeping complexity under control is a must. In this paper, we study the performance of a system that integrates a DL model as a trade-off between the accuracy and the complexity. At the same time, we relate the complexity to the efficiency of the system. With this, we present a practical study that aims to explore the challenges met when optimizing the performance of DL models becomes a requirement. Concretely, we aim to identify: (i) the most concerning challenges when deploying DL-based software in mobile applications; and (ii) the path for optimizing the performance trade-off. We obtain results that verify many of the identified challenges in the related work such as the availability of frameworks and the software-data dependency. We provide a documentation of our experience when facing the identified challenges together with the discussion of possible solutions to them. Additionally, we implement a solution to the sustainability of the DL models when deployed in order to reduce the severity of other identified challenges. Moreover, we relate the performance trade-off to a new defined challenge featuring the impact of the complexity in the obtained accuracy. Finally, we discuss and motivate future work that aims to provide solutions to the more open challenges found.

Deep reinforcement learning (RL) algorithms have shown an impressive ability to learn complex control policies in high-dimensional environments. However, despite the ever-increasing performance on popular benchmarks such as the Arcade Learning Environment (ALE), policies learned by deep RL algorithms often struggle to generalize when evaluated in remarkably similar environments. In this paper, we assess the generalization capabilities of DQN, one of the most traditional deep RL algorithms in the field. We provide evidence suggesting that DQN overspecializes to the training environment. We comprehensively evaluate the impact of traditional regularization methods, $\ell_2$-regularization and dropout, and of reusing the learned representations to improve the generalization capabilities of DQN. We perform this study using different game modes of Atari 2600 games, a recently introduced modification for the ALE which supports slight variations of the Atari 2600 games traditionally used for benchmarking. Despite regularization being largely underutilized in deep RL, we show that it can, in fact, help DQN learn more general features. These features can then be reused and fine-tuned on similar tasks, considerably improving the sample efficiency of DQN.

Graph Neural Networks (GNNs) for representation learning of graphs broadly follow a neighborhood aggregation framework, where the representation vector of a node is computed by recursively aggregating and transforming feature vectors of its neighboring nodes. Many GNN variants have been proposed and have achieved state-of-the-art results on both node and graph classification tasks. However, despite GNNs revolutionizing graph representation learning, there is limited understanding of their representational properties and limitations. Here, we present a theoretical framework for analyzing the expressive power of GNNs in capturing different graph structures. Our results characterize the discriminative power of popular GNN variants, such as Graph Convolutional Networks and GraphSAGE, and show that they cannot learn to distinguish certain simple graph structures. We then develop a simple architecture that is provably the most expressive among the class of GNNs and is as powerful as the Weisfeiler-Lehman graph isomorphism test. We empirically validate our theoretical findings on a number of graph classification benchmarks, and demonstrate that our model achieves state-of-the-art performance.

Why deep neural networks (DNNs) capable of overfitting often generalize well in practice is a mystery in deep learning. Existing works indicate that this observation holds for both complicated real datasets and simple datasets of one-dimensional (1-d) functions. In this work, for natural images and low-frequency dominant 1-d functions, we empirically found that a DNN with common settings first quickly captures the dominant low-frequency components, and then relatively slowly captures high-frequency ones. We call this phenomenon Frequency Principle (F-Principle). F-Principle can be observed over various DNN setups of different activation functions, layer structures and training algorithms in our experiments. F-Principle can be used to understand (i) the behavior of DNN training in the information plane and (ii) why DNNs often generalize well albeit its ability of overfitting. This F-Principle potentially can provide insights into understanding the general principle underlying DNN optimization and generalization for real datasets.

The robust and efficient recognition of visual relations in images is a hallmark of biological vision. Here, we argue that, despite recent progress in visual recognition, modern machine vision algorithms are severely limited in their ability to learn visual relations. Through controlled experiments, we demonstrate that visual-relation problems strain convolutional neural networks (CNNs). The networks eventually break altogether when rote memorization becomes impossible such as when the intra-class variability exceeds their capacity. We further show that another type of feedforward network, called a relational network (RN), which was shown to successfully solve seemingly difficult visual question answering (VQA) problems on the CLEVR datasets, suffers similar limitations. Motivated by the comparable success of biological vision, we argue that feedback mechanisms including working memory and attention are the key computational components underlying abstract visual reasoning.

Sayantan Choudhury,Ankan Dutta,Debisree Ray
0+阅读 · 3月16日
Hesameddin Mohammadi,Armin Zare,Mahdi Soltanolkotabi,Mihailo R. Jovanović
0+阅读 · 3月15日
Jan Grebík,Václav Rozhoň
0+阅读 · 3月15日
Roger Creus Castanyer,Silverio Martínez-Fernández,Xavier Franch
0+阅读 · 3月11日
5+阅读 · 2019年1月30日
Keyulu Xu,Weihua Hu,Jure Leskovec,Stefanie Jegelka
18+阅读 · 2018年10月1日
Zhi-Qin J. Xu,Yaoyu Zhang,Yanyang Xiao
3+阅读 · 2018年8月21日
Matthew Ricci,Junkyung Kim,Thomas Serre
5+阅读 · 2018年2月12日

37+阅读 · 3月16日

37+阅读 · 2020年12月14日

68+阅读 · 2020年5月15日

23+阅读 · 2020年4月8日

115+阅读 · 2020年1月16日

74+阅读 · 2019年10月11日

CreateAMind
8+阅读 · 2019年5月18日
CreateAMind
6+阅读 · 2019年1月18日
CreateAMind
9+阅读 · 2019年1月2日

7+阅读 · 2018年12月12日
CreateAMind
8+阅读 · 2018年12月10日

11+阅读 · 2018年7月9日
CreateAMind
3+阅读 · 2018年4月15日

31+阅读 · 2017年12月10日
CreateAMind
11+阅读 · 2017年8月2日
Top