In many applications where collecting data is expensive, for example neuroscience or medical imaging, the sample size is typically small compared to the feature dimension. It is challenging in this setting to train expressive, non-linear models without overfitting. These datasets call for intelligent regularization that exploits known structure, such as correlations between the features arising from the measurement device. However, existing structured regularizers need specially crafted solvers, which are difficult to apply to complex models. We propose a new regularizer specifically designed to leverage structure in the data in a way that can be applied efficiently to complex models. Our approach relies on feature grouping, using a fast clustering algorithm inside a stochastic gradient descent loop: given a family of feature groupings that capture feature covariations, we randomly select these groups at each iteration. We show that this approach amounts to enforcing a denoising regularizer on the solution. The method is easy to implement in many model architectures, such as fully connected neural networks, and has a linear computational cost. We apply this regularizer to a real-world fMRI dataset and the Olivetti Faces datasets. Experiments on both datasets demonstrate that the proposed approach produces models that generalize better than those trained with conventional regularizers, and also improves convergence speed.
翻译:在许多收集数据费用昂贵的应用程序中,例如神经科学或医学成像,抽样规模通常与特征层面相比很小。在这个环境中,在不过度安装的情况下,培养直观的非线性模型具有挑战性。这些数据集要求智能的正规化,利用已知的结构,例如测量设备产生的特征之间的相互关系。然而,现有的结构化的正规化者需要特别设计的解答器,难以应用于复杂的模型。我们提议一种新的正规化器,专门设计来利用数据结构的结构,以便有效地应用于复杂的模型。我们的方法依赖于特征组合,在随机梯度梯度下游中使用快速组合算法:鉴于一组特征组合,可以捕捉特征变异,我们在每次迭代中随机选择这些组。我们表明,这种方法相当于在解决方案上强制实施一种解析解析的正规化解答器。这个方法很容易在很多模型中实施,例如完全连接的神经网络,并且有线性计算成本。我们把这一正规化器应用到真实世界的FMRI数据集和奥利斯蒂面基面基基底基底基底基下下下层的加速化方法。我们也可以在常规数据模型上进行更精确的实验,从而改进了常规化。