用于模型预测控制的时间差异学习 (Temporal Difference Learning for Model Predictive Control)

Data-driven model predictive control has two key advantages over model-free methods: a potential for improved sample efficiency through model learning, and better performance as computational budget for planning increases. However, it is both costly to plan over long horizons and challenging to obtain an accurate model of the environment. In this work, we combine the strengths of model-free and model-based methods. We use a learned task-oriented latent dynamics model for local trajectory optimization over a short horizon, and use a learned terminal value function to estimate long-term return, both of which are learned jointly by temporal difference learning. Our method, TD-MPC, achieves superior sample efficiency and asymptotic performance over prior work on both state and image-based continuous control tasks from DMControl and Meta-World. Code and video results are available at https://nicklashansen.github.io/td-mpc.

翻译：数据驱动模型预测控制比无模式方法具有两个关键优势:通过模型学习提高样本效率的潜力,随着规划计算预算的增加而提高绩效;然而,在长期规划方面规划成本很高,要获得准确的环境模型则具有挑战性;在这项工作中,我们结合了无模型和基于模型的方法的优势;我们利用一个学习到的任务导向的潜在动态模型,在短期内优化当地轨道优化,并使用一个学习到的任务导向潜在动态模型来估计长期回报,两者都是通过时间差异学习共同学习的。我们的方法,即TD-MPC,在以前关于DM Control和Meta-World基于图像的连续控制任务的工作上,实现了较高的样本效率和无效果。代码和视频结果可在https://nicklachansen.github.io/td-mpc上查阅。