We present foundations for using Model Predictive Control (MPC) as a differentiable policy class for reinforcement learning in continuous state and action spaces. This provides one way of leveraging and combining the advantages of model-free and model-based approaches. Specifically, we differentiate through MPC by using the KKT conditions of the convex approximation at a fixed point of the controller. Using this strategy, we are able to learn the cost and dynamics of a controller via end-to-end learning. Our experiments focus on imitation learning in the pendulum and cartpole domains, where we learn the cost and dynamics terms of an MPC policy class. We show that our MPC policies are significantly more data-efficient than a generic neural network and that our method is superior to traditional system identification in a setting where the expert is unrealizable.
翻译:我们为利用模型预测控制作为在连续状态和行动空间加强学习的不同政策类别奠定了基础,这是利用和结合无模型和基于模型的方法的优势的一种方式。具体地说,我们通过模型控制在控制者固定点使用曲线近似的KKT条件来区分模型控制。我们利用这一战略,能够通过端对端学习来了解控制者的成本和动态。我们的实验侧重于在钟式和轮式领域进行模仿学习,我们在那里学习多功能控制政策类别的成本和动态条件。我们表明,我们的多功能控制政策比普通神经网络更具有数据效率,我们的方法优于专家无法实现的环境中的传统系统识别。