It has been empirically observed that, in deep neural networks, the solutions found by stochastic gradient descent from different random initializations can be often connected by a path with low loss. Recent works have shed light on this intriguing phenomenon by assuming either the over-parameterization of the network or the dropout stability of the solutions. In this paper, we reconcile these two views and present a novel condition for ensuring the connectivity of two arbitrary points in parameter space. This condition is provably milder than dropout stability, and it provides a connection between the problem of finding low-loss paths and the memorization capacity of neural nets. This last point brings about a trade-off between the quality of features at each layer and the over-parameterization of the network. As an extreme example of this trade-off, we show that (i) if subsets of features at each layer are linearly separable, then almost no over-parameterization is needed, and (ii) under generic assumptions on the features at each layer, it suffices that the last two hidden layers have $\Omega(\sqrt{N})$ neurons, $N$ being the number of samples. Finally, we provide experimental evidence demonstrating that the presented condition is satisfied in practical settings even when dropout stability does not hold.
翻译:人们从经验中观察到,在深层神经网络中,不同随机初始化的随机梯度梯度下降所发现的解决办法往往可以通过低损失路径连接起来。最近的工程通过假设网络的超分度或解决方案的辍学稳定性,揭示出这种令人感兴趣的现象。在本文中,我们调和这两种观点,并提出了确保参数空间两个任意点连接的新条件。这个条件比辍学稳定性要轻得多,它提供了找到低损失路径的问题和神经网的记忆能力之间的关联。这最后一点使每一层地物质量与网络的超分性之间产生了一种权衡。作为这种权衡的一个极端例子,我们表明:(一)如果每个层的地物分是线性可分离的,那么几乎不需要过分的测量,以及(二)在每一层地貌的通用假设下,最后两层的隐性地层甚至有 $\ qqrt=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx