If the trend of learned components eventually outperforming their hand-crafted version continues, learned optimizers will eventually outperform hand-crafted optimizers like SGD or Adam. Even if learned optimizers (L2Os) eventually outpace hand-crafted ones in practice however, they are still not provably convergent and might fail out of distribution. These are the questions addressed here. Currently, learned optimizers frequently outperform generic hand-crafted optimizers (such as gradient descent) at the beginning of learning but they generally plateau after some time while the generic algorithms continue to make progress and often overtake the learned algorithm as Aesop's tortoise which overtakes the hare. L2Os also still have a difficult time generalizing out of distribution. Heaton et al. proposed Safeguarded L2O (GL2O) which can take a learned optimizer and safeguard it with a generic learning algorithm so that by conditionally switching between the two, the resulting algorithm is provably convergent. We propose a new class of Safeguarded L2O, called Loss-Guarded L2O (LGL2O), which is both conceptually simpler and computationally less expensive. The guarding mechanism decides solely based on the expected future loss value of both optimizers. Furthermore, we show theoretical proof of LGL2O's convergence guarantee and empirical results comparing to GL2O and other baselines showing that it combines the best of both L2O and SGD and that in practice converges much better than GL2O.
翻译:如果学习到的部件的趋势最终超过手工制作的版本, 学习到的优化将最终超过SGD 或 Adam 等手工制作的优化。 即使学习到的优化(L2Os)最终在实践上比手工制作的优化(L2Os)最终的速度快, 它们仍然不能令人看似趋同, 并且可能无法从分发中脱身。 这些是这里讨论的问题。 目前, 学习到的优化往往超过学习开始时的通用手工制作的优化( 如坡度下降), 但它们一般在一段时间后会达到顶峰, 而通用算法则继续取得进展, 并且往往会超过学习到的手工制作的优化。 L2Os(L2O)也有一个困难的时间来概括分布。 Heaton 等人(等人) 提出的保障L2O(GL2O) (GL2) (GL2O) (SL2) (SL2) (SL2) (SL2) (SL2) (O (SL2) (SL2) (SOL) (SOL) (SUniumniumniumnialalalal co) (O (O) 和SUIL2) (SUIL2) (IL2) (SUL2) (IL2) (也决定更小) (SUIL2) (S) (SU) (S) (S) (SU) (更小的 ) (后, ) (更难 ) ) 和 (更小的优化机制。