CIM: 抑制的内向动力,以持续控制微粒-回退 (CIM: Constrained Intrinsic Motivation for Sparse-Reward Continuous Control)

Intrinsic motivation is a promising exploration technique for solving reinforcement learning tasks with sparse or absent extrinsic rewards. There exist two technical challenges in implementing intrinsic motivation: 1) how to design a proper intrinsic objective to facilitate efficient exploration; and 2) how to combine the intrinsic objective with the extrinsic objective to help find better solutions. In the current literature, the intrinsic objectives are all designed in a task-agnostic manner and combined with the extrinsic objective via simple addition (or used by itself for reward-free pre-training). In this work, we show that these designs would fail in typical sparse-reward continuous control tasks. To address the problem, we propose Constrained Intrinsic Motivation (CIM) to leverage readily attainable task priors to construct a constrained intrinsic objective, and at the same time, exploit the Lagrangian method to adaptively balance the intrinsic and extrinsic objectives via a simultaneous-maximization framework. We empirically show, on multiple sparse-reward continuous control tasks, that our CIM approach achieves greatly improved performance and sample efficiency over state-of-the-art methods. Moreover, the key techniques of our CIM can also be plugged into existing methods to boost their performances.

翻译：内在动机是一种很有希望的探索技术,用稀少或不存在的外部奖励来完成强化学习任务; 在执行内在动机方面存在着两个技术挑战:(1) 如何设计适当的内在目标以促进有效勘探;(2) 如何将内在目标与外部目标结合起来,以帮助找到更好的解决办法;在目前的文献中,内在目标都是以任务不可知的方式设计,并通过简单添加(或本身用于无酬培训前)与外部目标相结合;在这项工作中,我们表明这些设计在典型的稀疏持续控制任务中将失败。为了解决问题,我们提议对内在动力进行集中训练,以便在建立受限制的内在目标之前利用可轻易实现的任务,同时利用拉格朗方法,通过同时实现最大程度框架,适应性地平衡内在目标和极限目标。我们从经验上表明,在多度低度连续控制任务中,我们的CIM方法将大大改进了业绩和抽样,超越了州级持续控制方法。此外,我们还可以将关键技术推广到现有的CIM方法。