MBB: 高效加强学习的示范基线 (MBB: Model-Based Baseline for Efficient Reinforcement Learning)

Model-free reinforcement learning (RL) is capable of learning control policies for high-dimensional, complex robotic tasks, but tends to be data-inefficient. Model-based RL tends to be more data-efficient but often suffers from learning a high-dimensional model that is good enough for policy improvement. This limits its use to learning simple models for restrictive domains. Optimal control generates solutions without collecting any data, assuming an accurate model of the system and environment is known, which is often true in many control theory applications. However, optimal control cannot be scaled to problems with a high-dimensional state space. In this paper, we propose a novel approach to alleviate data inefficiency of model-free RL in high-dimensional problems by warm-starting the learning process using a lower-dimensional model-based solution. Particularly, we initialize a baseline function for the high-dimensional RL problem via supervision from a lower-dimensional value function, which can be obtained by solving a lower-dimensional problem with a known, approximate model using "classical" techniques such as value iteration or optimal control. Therefore, our approach implicitly exploits the model priors from simplified problem space to facilitate the policy learning in high-dimensional RL tasks. We demonstrate our approach on two representative robotic learning tasks and observe significant improvement in policy performance and learning efficiency. We also evaluate our method empirically with a third task.

翻译：无模型强化学习(RL)能够学习高维、复杂的机器人任务的控制政策,但往往缺乏数据效率。基于模型的RL往往更具有数据效率,但往往会因为学习一种高维模式而受到影响,而这种模式对于政策改进来说足够好。这限制了它的使用,而只是学习用于限制性领域的简单模型。最佳控制产生一些不收集任何数据的解决方案,假设系统和环境的精确模型在许多控制理论应用中常有,但最佳控制不能扩大到高维状态空间的问题。然而,在本文中,我们提出一种新颖的方法,通过利用低维模式的模型启动学习过程,减轻高维问题中无模型RL效率的数据。特别是,我们通过从低维价值功能的监管,为高维RL问题建立基线功能,这可以通过使用“古典”技术(如增值或最佳控制)的已知、近维度模型来解决一个低维的问题。因此,我们的方法隐含地利用了模型,从简化的空间角度启动学习过程过程,并用高维标准的方法学习我们的高维政策任务。