解释深层学习中的概括:进展和基本限制 (Explaining generalization in deep learning: progress and fundamental limits)

This dissertation studies a fundamental open challenge in deep learning theory: why do deep networks generalize well even while being overparameterized, unregularized and fitting the training data to zero error? In the first part of the thesis, we will empirically study how training deep networks via stochastic gradient descent implicitly controls the networks' capacity. Subsequently, to show how this leads to better generalization, we will derive {\em data-dependent} {\em uniform-convergence-based} generalization bounds with improved dependencies on the parameter count. Uniform convergence has in fact been the most widely used tool in deep learning literature, thanks to its simplicity and generality. Given its popularity, in this thesis, we will also take a step back to identify the fundamental limits of uniform convergence as a tool to explain generalization. In particular, we will show that in some example overparameterized settings, {\em any} uniform convergence bound will provide only a vacuous generalization bound. With this realization in mind, in the last part of the thesis, we will change course and introduce an {\em empirical} technique to estimate generalization using unlabeled data. Our technique does not rely on any notion of uniform-convergece-based complexity and is remarkably precise. We will theoretically show why our technique enjoys such precision. We will conclude by discussing how future work could explore novel ways to incorporate distributional assumptions in generalization bounds (such as in the form of unlabeled data) and explore other tools to derive bounds, perhaps by modifying uniform convergence or by developing completely new tools altogether.

翻译：深层学习理论中,这种差异研究是一个根本性的公开挑战:为什么深层网络在过度量化、不正规化和将培训数据安装为零错误的同时,仍然非常普遍?在论文的第一部分,我们将实证地研究如何通过随机梯度梯度下降来培训深层网络,从而间接控制网络的能力。随后,为了表明这如何导致更好的概括化,我们将得出统一一致的趋同,同时改善参数的可靠性。在深层学习文献中,统一趋同事实上是最广泛使用的工具,因为其简单性和普遍性。在论文的第一部分中,我们还将从经验上研究如何通过分层梯度梯度梯度梯度梯度来确定统一趋同的基本限度,作为解释概括化的工具。特别是,为了表明这如何导致更精确化,我们的统一趋同将只提供一种模糊的概括性约束。在论文的最后一部分,我们或许会改变方向,引入一种不精确的经验趋同的趋同性工具,我们又会通过精确的精确的精确的推算方法来确定我们将来的精确的推算方法。