中继重视重现经验重放: 以 Sparse 奖励进行序列对象操纵任务自导持续强化学习 (Relay Hindsight Experience Replay: Self-Guided Continual Reinforcement Learning for Sequential Object Manipulation Tasks with Sparse Rewards)

Exploration with sparse rewards remains a challenging research problem in reinforcement learning (RL). Especially for sequential object manipulation tasks, the RL agent always receives negative rewards until completing all sub-tasks, which results in low exploration efficiency. To solve these tasks efficiently, we propose a novel self-guided continual RL framework, RelayHER (RHER). RHER first decomposes a sequential task into new sub-tasks with increasing complexity and ensures that the simplest sub-task can be learned quickly by utilizing Hindsight Experience Replay (HER). Secondly, we design a multi-goal & multi-task network to learn these sub-tasks simultaneously. Finally, we propose a Self-Guided Exploration Strategy (SGES). With SGES, the learned sub-task policy will guide the agent to the states that are helpful to learn more complex sub-task with HER. By this self-guided exploration and relay policy learning, RHER can solve these sequential tasks efficiently stage by stage. The experimental results show that RHER significantly outperforms vanilla-HER in sample-efficiency on five singleobject and five complex multi-object manipulation tasks (e.g., Push, Insert, ObstaclePush, Stack, TStack, etc.). The proposed RHER has also been applied to learn a contact-rich push task on a physical robot from scratch, and the success rate reached 10/10 with only 250 episodes.

翻译：在强化学习(RL)方面,探索微薄的奖励仍然是一项具有挑战性的研究问题。特别是对于相继的物体操纵任务,RL代理总是在完成所有子任务之前得到负面回报,直到完成所有子任务,从而降低勘探效率。为了高效地解决这些任务,我们提议了一个全新的自我指导连续RL框架,即Relayher(Rherh)。RHER首先将一个相继的任务分解成一个复杂程度越来越高的新子任务,并确保利用Hindsight 经验重现(HER)能够快速地学习这些简单的子任务。第二,我们设计了一个多目标和多任务网络,以便同时学习这些子任务。最后,我们提出了一个自我指导的探索战略(SESG ) 。学到的子任务将引导该代理到有助于与她学习更复杂的子任务。通过这种自我指导的探索和中继政策学习,RHER只能通过阶段高效地完成这些顺序任务。实验结果显示,RHER在5个单轴和5个复杂的SHER-HER的样本效率方面大大超越了 VAN-HER-HER-HER 。 Stash-hack licaltraclection, 和10 liver-h-tracleg-h-tracleg-traction, 10-tractal-tractal bedddrodudu, 10, Stow, Stash-trapaldroppledroppledroppledroppledrodu, Stow, 10-tractaldaldaldrodu) 。