Imitation learning, followed by reinforcement learning algorithms, is a promising paradigm to solve complex control tasks sample-efficiently. However, learning from demonstrations often suffers from the covariate shift problem, which results in cascading errors of the learned policy. We introduce a notion of conservatively-extrapolated value functions, which provably lead to policies with self-correction. We design an algorithm Value Iteration with Negative Sampling (VINS) that practically learns such value functions with conservative extrapolation. We show that VINS can correct mistakes of the behavioral cloning policy on simulated robotics benchmark tasks. We also propose the algorithm of using VINS to initialize a reinforcement learning algorithm, which is shown to outperform significantly prior works in sample efficiency.
翻译:光学学习,然后是强化学习算法,是解决复杂控制任务抽样有效抽样的有希望的范例。然而,从演示中学习往往受到共变转移问题的影响,这导致所学政策的分级错误。我们引入了保守的外推价值功能的概念,这可以导致自我校正的政策。我们设计了一种带有负抽样的算法转换法,该算法实际上以保守的外推法来学习这种价值功能。我们显示VINS可以纠正模拟机器人基准任务的行为克隆政策的错误。我们还提出了使用VINS来启动强化学习算法的算法,这在抽样效率方面明显超过先前的工作。