We consider the optimization problem arising from fitting two-layer ReLU networks with $d$ inputs under the square loss, where labels are generated by a target network. Two infinite families of spurious minima have recently been identified: one whose loss vanishes as $d \to \infty$, and another whose loss remains bounded away from zero. The latter are nevertheless avoided by vanilla SGD, and thus hidden, motivating the search for analytic properties distinguishing the two types. Perhaps surprisingly, the Hessian spectra of hidden and non-hidden minima agree up to terms of order $O(d^{-1/2})$, providing limited explanatory power. Consequently, our analysis of hidden minima proceeds instead via curves along which the loss is minimized or maximized. The main result is that arcs emanating from hidden minima differ, characteristically, by their structure and symmetry, precisely on account of the $O(d^{-1/2})$-eigenvalue terms absent from previous analyses.
翻译:我们考虑在平方损失下拟合具有$d$个输入的双层ReLU网络所产生的优化问题,其中标签由目标网络生成。最近已识别出两类无限族的伪极小值:一类其损失随$d \to \infty$趋于零,另一类其损失始终有界且远离零。然而后者被原始SGD所规避,因而成为隐藏极小值,这促使我们寻找区分这两类极小值的解析性质。或许令人惊讶的是,隐藏与非隐藏极小值的Hessian谱在$O(d^{-1/2})$阶项内一致,其解释能力有限。因此,我们转而通过损失最小化或最大化的曲线来分析隐藏极小值。主要结果表明,从隐藏极小点出发的弧线在结构和对称性上存在特征性差异,这正是由于先前分析中缺失的$O(d^{-1/2})$阶本征值项所导致的。