Stochastic Gradient Descent (SGD) is the central workhorse for training modern CNNs. Although giving impressive empirical performance it can be slow to converge. In this paper we explore a novel strategy for training a CNN using an alternation strategy that offers substantial speedups during training. We make the following contributions: (i) replace the ReLU non-linearity within a CNN with positive hard-thresholding, (ii) reinterpret this non-linearity as a binary state vector making the entire CNN linear if the multi-layer support is known, and (iii) demonstrate that under certain conditions a global optima to the CNN can be found through local descent. We then employ a novel alternation strategy (between weights and support) for CNN training that leads to substantially faster convergence rates, nice theoretical properties, and achieving state of the art results across large scale datasets (e.g. ImageNet) as well as other standard benchmarks.
翻译:现代CNN的核心工作马(SGD)是培训现代CNN的核心工作马。 虽然给人以深刻的实验性表现,但这种经验性表现可能比较缓慢。 在本文中,我们探索了一种新颖的战略,利用在培训期间提供大量超速的交替战略来培训CNN。 我们做出以下贡献:(一) 将CNN中的RELU非线性替换为积极的硬藏,(二) 将这种非线性重新解释为二进制的国家矢量,如果多层支持为人所知,则将CNN整个线性作为二进制矢量,以及(三) 证明在某些条件下可通过本地血统找到CNNNNN的全球Optima。 然后,我们对CNN培训采用了一种新的交替战略(权重与支持),这种交配率、良好的理论属性以及实现大规模数据集(如图象网)和其他标准基准的艺术成果状态。