评分高级访问访问考察强化学习的失分情况 (Rewarding Episodic Visitation Discrepancy for Exploration in Reinforcement Learning)

Exploration is critical for deep reinforcement learning in complex environments with high-dimensional observations and sparse rewards. To address this problem, recent approaches proposed to leverage intrinsic rewards to improve exploration, such as novelty-based exploration and prediction-based exploration. However, many intrinsic reward modules require sophisticated structures and representation learning, resulting in prohibitive computational complexity and unstable performance. In this paper, we propose Rewarding Episodic Visitation Discrepancy (REVD), a computation-efficient and quantified exploration method. More specifically, REVD provides intrinsic rewards by evaluating the R\'enyi divergence-based visitation discrepancy between episodes. To make efficient divergence estimation, a k-nearest neighbor estimator is utilized with a randomly-initialized state encoder. Finally, the REVD is tested on Atari games and PyBullet Robotics Environments. Extensive experiments demonstrate that REVD can significantly improves the sample efficiency of reinforcement learning algorithms and outperforms the benchmarking methods.

翻译：为了解决这一问题,最近提议了一些办法,以利用内在奖励来改进勘探,例如以新颖的勘探和预测为基础的勘探。然而,许多内在奖励模块需要复杂的结构和代表性学习,导致令人望而却步的计算复杂性和不稳定的性能。在本论文中,我们提议奖励访问差异(REVD),这是一种具有计算效率和量化的探索方法。更具体地说,REVD通过评价R\'enyi基于差异的访问差异差异,提供内在奖励,改善勘探。为了作出有效的差异估计,使用一个随机的初始化状态编码器来使用Knearest邻居估计器。最后,REVD是在Atari游戏和PyBullet机器人环境上测试的。广泛的实验表明,REVD可以大大提高强化学习算法的抽样效率,并超越基准方法。