Neural networks trained with stochastic gradient descent (SGD) starting from different random initialisations typically find functionally very similar solutions, raising the question of whether there are meaningful differences between different SGD solutions. Entezari et al. recently conjectured that despite different initialisations, the solutions found by SGD lie in the same loss valley after taking into account the permutation invariance of neural networks. Concretely, they hypothesise that any two solutions found by SGD can be permuted such that the linear interpolation between their parameters forms a path without significant increases in loss. Here, we use a simple but powerful algorithm to find such permutations that allows us to obtain direct empirical evidence that the hypothesis is true in fully connected networks. Strikingly, we find that two networks already live in the same loss valley at the time of initialisation and averaging their random, but suitably permuted initialisation performs significantly above chance. In contrast, for convolutional architectures, our evidence suggests that the hypothesis does not hold. Especially in a large learning rate regime, SGD seems to discover diverse modes.
翻译:Entezari等人最近推测,尽管初始化不同,但SGD发现的解决办法在考虑神经网络的变异性之后,位于同一个损失谷。具体地说,他们假设SGD发现的任何两种解决办法都可以被变换,以便其参数之间的线性内插形成一条路径,而没有显著增加损失。在这里,我们使用简单而有力的算法找到这种变相,使我们能够直接获得经验证据,证明这种假设在完全连接的网络中是真实的。奇怪的是,我们发现两个网络在初始化时已经生活在同一损失谷中,并且平均其随机性,但相近的变异初始化却比机会大得多。相反,对于革命性结构,我们的证据表明,这种假设没有形成一条路径,没有显著增加损失。特别是在一个大型学习率制度中,SGD似乎发现了不同的模式。