We consider non-convex stochastic optimization problems where the objective functions have super-linearly growing and discontinuous stochastic gradients. In such a setting, we provide a non-asymptotic analysis for the tamed unadjusted stochastic Langevin algorithm (TUSLA) introduced in Lovas et al. (2020). In particular, we establish non-asymptotic error bounds for the TUSLA algorithm in Wasserstein-1 and Wasserstein-2 distances. The latter result enables us to further derive non-asymptotic estimates for the expected excess risk. To illustrate the applicability of the main results, we consider an example from transfer learning with ReLU neural networks, which represents a key paradigm in machine learning. Numerical experiments are presented for the aforementioned example which support our theoretical findings. Hence, in this setting, we demonstrate both theoretically and numerically that the TUSLA algorithm can solve the optimization problem involving neural networks with ReLU activation function. Besides, we provide simulation results for synthetic examples where popular algorithms, e.g. ADAM, AMSGrad, RMSProp, and (vanilla) stochastic gradient descent (SGD) algorithm, may fail to find the minimizer of the objective functions due to the super-linear growth and the discontinuity of the corresponding stochastic gradient, while the TUSLA algorithm converges rapidly to the optimal solution. Moreover, we provide an empirical comparison of the performance of TUSLA with popular stochastic optimizers on real-world datasets, as well as investigate the effect of the key hyperparameters of TUSLA on its performance.
翻译:TUSLA算法在具有ReLU激活函数的神经网络中的非凸学习中的非渐近估计及应用
翻译后的摘要:
我们考虑非凸性随机优化问题,其中目标函数具有超线性增长和不连续的随机梯度。在这种情况下,我们为Lovas等人(2020)介绍的TUSLA算法提供非渐近估计分析。特别地,我们在Wasserstein-1和Wasserstein-2距离上建立了TUSLA算法的非渐近误差界。后者的结果使我们能够进一步推导预期过量风险的非渐近估计。为了说明主要结果的适用性,我们考虑了具有ReLU神经网络的迁移学习示例,这代表了机器学习中的关键范例。为了支持我们的理论结果,我们提供了上述示例的数值实验。因此,在这种情况下,我们从理论和数值角度证明了TUSLA算法可以解决涉及具有ReLU激活函数的神经网络的优化问题。此外,我们提供了合成示例的模拟结果,其中流行的算法,例如ADAM,AMSGrad,RMSProp和(vanilla)随机梯度下降(SGD)算法,可能由于相应随机梯度的超线性增长和不连续性而无法找到目标函数的极小化器,而TUSLA算法迅速收敛到最优解。此外,我们还提供了TUSLA与流行随机优化器在真实数据集上性能的实证比较,并调查TUSLA的关键超参数对其性能的影响。