通过政策模拟学习进行模型预测控制 (Model Predictive Control via On-Policy Imitation Learning)

In this paper, we leverage the rapid advances in imitation learning, a topic of intense recent focus in the Reinforcement Learning (RL) literature, to develop new sample complexity results and performance guarantees for data-driven Model Predictive Control (MPC) for constrained linear systems. In its simplest form, imitation learning is an approach that tries to learn an expert policy by querying samples from an expert. Recent approaches to data-driven MPC have used the simplest form of imitation learning known as behavior cloning to learn controllers that mimic the performance of MPC by online sampling of the trajectories of the closed-loop MPC system. Behavior cloning, however, is a method that is known to be data inefficient and suffer from distribution shifts. As an alternative, we develop a variant of the forward training algorithm which is an on-policy imitation learning method proposed by Ross et al. (2010). Our algorithm uses the structure of constrained linear MPC, and our analysis uses the properties of the explicit MPC solution to theoretically bound the number of online MPC trajectories needed to achieve optimal performance. We validate our results through simulations and show that the forward training algorithm is indeed superior to behavior cloning when applied to MPC.

翻译：在本文中,我们利用模仿学习的快速进展,这是加强学习文献中最近集中关注的一个专题,为数据驱动的受限制线性系统模型预测控制(MPC)开发新的样本复杂性结果和性能保障。在最简单的形式上,模仿学习是一种试图通过查询专家的样本来学习专家政策的方法。最近对数据驱动的MPC的做法使用被称为行为克隆的最简单形式的模仿学习形式来学习模仿MPC的操作器,通过对闭环MPC系统的轨迹进行在线抽样抽样,模仿MPC的性能。然而,行为克隆是一种已知数据效率低且分布变化影响的方法。作为一种替代办法,我们开发了前方培训算法的变式,这是罗斯等人(2010年)提出的政策模拟学习方法。我们的算法使用了受限制的线性MPC结构,我们的分析利用明确的MPC解决办法的特性从理论上将实现最佳性能所需的在线MPC轨迹的数目加以约束。我们通过模拟和演算法来验证我们的前方培训结果。我们通过模拟和演算法来显示我们实现最佳性能。