In Reinforcement Learning (RL), an agent explores the environment and collects trajectories into the memory buffer for later learning. However, the collected trajectories can easily be imbalanced with respect to the achieved goal states. The problem of learning from imbalanced data is a well-known problem in supervised learning, but has not yet been thoroughly researched in RL. To address this problem, we propose a novel Curiosity-Driven Prioritization (CDP) framework to encourage the agent to over-sample those trajectories that have rare achieved goal states. The CDP framework mimics the human learning process and focuses more on relatively uncommon events. We evaluate our methods using the robotic environment provided by OpenAI Gym. The environment contains six robot manipulation tasks. In our experiments, we combined CDP with Deep Deterministic Policy Gradient (DDPG) with or without Hindsight Experience Replay (HER). The experimental results show that CDP improves both performance and sample-efficiency of reinforcement learning agents, compared to state-of-the-art methods.
翻译:在强化学习(RL)中,一个代理商探索环境,将轨迹收集到记忆缓冲中,供以后学习。然而,所收集的轨迹很容易与达到的目标状态相比不平衡。从不平衡的数据中学习的问题在监督学习中是一个众所周知的问题,但在RL中尚未对此进行彻底研究。为了解决这个问题,我们提议了一个新型的Curiosity-Driven Endition(CDP)框架,以鼓励该代理商过多地标出那些很少达到的目标状态。CDP框架模仿人类学习过程,更侧重于相对罕见的事件。我们用OpenAI Gym提供的机器人环境来评估我们的方法。环境包含六项机器人操作任务。在我们的实验中,我们把CDP与深度威慑政策梯度(DPG)和Hindsight 经验重现(Hindsight Replay)结合起来。实验结果表明,CDP改进了强化学习代理商的性能和样本效率,而不是最先进的方法。