Multi-goal reinforcement learning is widely used in planning and robot manipulation. Two main challenges in multi-goal reinforcement learning are sparse rewards and sample inefficiency. Hindsight Experience Replay (HER) aims to tackle the two challenges with hindsight knowledge. However, HER and its previous variants still need millions of samples and a huge computation. In this paper, we propose \emph{Multi-step Hindsight Experience Replay} (MHER) based on $n$-step relabeling, incorporating multi-step relabeled returns to improve sample efficiency. Despite the advantages of $n$-step relabeling, we theoretically and experimentally prove the off-policy $n$-step bias introduced by $n$-step relabeling may lead to poor performance in many environments. To address the above issue, two bias-reduced MHER algorithms, MHER($\lambda$) and Model-based MHER (MMHER) are presented. MHER($\lambda$) exploits the $\lambda$ return while MMHER benefits from model-based value expansions. Experimental results on numerous multi-goal robotic tasks show that our solutions can successfully alleviate off-policy $n$-step bias and achieve significantly higher sample efficiency than HER and Curriculum-guided HER with little additional computation beyond HER.
翻译:多目标强化学习被广泛用于规划和机器人操作。多目标强化学习的两个主要挑战有:奖赏稀少和抽样低效。重现(HER)的目的是以事后知识应对这两项挑战。然而,她及其以前的变异仍需要数百万个样本和巨大的计算。在本文中,我们提议基于美元分步重新标签的多目标强化学习的两大挑战,包括多步重标签的回报,以提高样本效率。尽管分步重标签的好处是少步骤重标签,但我们理论上和实验性重现(HER)的目的是用美元分步重标签来应对这两项挑战。但是,她及其以前的变异组合仍然需要数百万个样本和巨大的计算。为了解决上述问题,我们提出了两种有偏差的超偏差的MHER算法、MHER($)和基于模型的MER(MHER)的回现收益,而MER($分步重标签)的回转,而MERHER(M-BAR)在模型基础上的更高标准化方法上成功展示了无数的超高标准。