Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications, and in particular for visuo-motor control. Model-based RL has the potential to be highly sample efficient by concurrently learning a world model and using synthetic rollouts for planning and policy improvement. However, in practice, sample-efficient learning with model-based RL is bottlenecked by the exploration challenge. In this work, we find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL. Simply appending demonstrations to the interaction dataset, however, does not suffice. We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework. We empirically study three complex visuo-motor control domains and find that our method is 150%-250% more successful in completing sparse reward tasks compared to prior approaches in the low data regime (100K interaction steps, 5 demonstrations). Code and videos are available at: https://nicklashansen.github.io/modemrl
翻译:低抽样效率仍然是在实际应用中部署深强化学习(RL)算法的主要挑战,特别是相对机体控制。基于模型的RL有可能通过同时学习世界模型和利用合成推出来进行规划和政策改进,从而具有高度的抽样效率。然而,在实践中,与基于模型的RL的抽样效率学习受到勘探挑战的制约。在这项工作中,我们发现,仅利用少量示范活动就能大大提高基于模型的RL的抽样效率。不过,仅仅在互动数据集中附加演示是不够的。我们确定了在模型学习中利用示范活动的关键要素 -- -- 政策预培训、有针对性的探索和过度抽样示范数据 -- -- 构成我们基于模型的RL框架的三个阶段。我们实证地研究了三个复杂的相对机体控制领域,发现我们的方法比以前在低数据制度中的做法(100K互动步骤、5演示)更成功完成了150%至250%的微奖赏任务。代码和视频见:https://nicklashasan.girub.molmo/modeml。