Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich (``heavy-tailed'') regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent's policy diverges from the behavioral policy (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. We then highlight issues arising due to the heavy-tailed nature of the gradients. In this light, we study the effects of the standard PPO clipping heuristics, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients. Thus motivated, we propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks. Despite requiring less hyperparameter tuning, our method matches the performance of PPO (with all heuristics enabled) on a battery of MuJoCo continuous control tasks.
翻译:Proximal政策优化等现代政策梯度算法取决于一系列的累赘学,包括损失剪切和梯度剪切,以确保成功学习。这些累赘学是来自强健统计的技术的象征,这些技术通常用于外部富裕(“重折尾”)制度的估计。在本文件中,我们提出详细的实证研究,以说明PPO代谢奖励功能的梯度的重尾性质。我们表明,梯度,特别是行为者网络的梯度,显示出明显的超尾和梯度,随着代理人的政策与行为政策(即代理人越走越远)不同而增加。进一步的研究将超尾奖奖励的可能性和优势作为观察到的重尾尾尾裁的主要来源。我们然后强调由于梯度的重尾伸性质而产生的问题。我们研究了标准PPO对超额报酬的影响,我们研究了这些骗局,表明这些骗局主要用来抵消高尾调的GMO(即代理人更进一步偏离政策),因此,我们提议采用高尾调高尾调的GMPO。