We introduce Self-supervised Online Reward Shaping (SORS), which aims to improve the sample efficiency of any RL algorithm in sparse-reward environments by automatically densifying rewards. The proposed framework alternates between classification-based reward inference and policy update steps -- the original sparse reward provides a self-supervisory signal for reward inference by ranking trajectories that the agent observes, while the policy update is performed with the newly inferred, typically dense reward function. We introduce theory that shows that, under certain conditions, this alteration of the reward function will not change the optimal policy of the original MDP, while potentially increasing learning speed significantly. Experimental results on several sparse-reward environments demonstrate that, across multiple domains, the proposed algorithm is not only significantly more sample efficient than a standard RL baseline using sparse rewards, but, at times, also achieves similar sample efficiency compared to when hand-designed dense reward functions are used.
翻译:我们引入了自我监督的在线奖赏形状(SORS),其目的是通过自动压缩奖励,提高稀有奖励环境中任何RL算法的样本效率。拟议框架在基于分类的奖赏推断和政策更新步骤之间互换 -- -- 原始的稀有奖赏提供了一种自我监督信号,以奖励代理人所观察到的轨迹的推论,而政策更新则与新推断的典型的密集奖赏功能一起进行。我们引入的理论表明,在某些条件下,这种奖赏功能的改变不会改变原始的MDP的最佳政策,同时可能大幅提高学习速度。几个稀有奖赏环境的实验结果表明,在多个领域,提议的算法不仅比标准RL基线使用微量奖赏的效率要高得多,而且有时还取得了与使用手工设计的密集奖函数时相似的样本效率。