This paper considers the problem of learning a model in model-based reinforcement learning (MBRL). We examine how the planning module of an MBRL algorithm uses the model, and propose that the model learning module should incorporate the way the planner is going to use the model. This is in contrast to conventional model learning approaches, such as those based on maximum likelihood estimate, that learn a predictive model of the environment without explicitly considering the interaction of the model and the planner. We focus on policy gradient type of planning algorithms and derive new loss functions for model learning that incorporate how the planner uses the model. We call this approach Policy-Aware Model Learning (PAML). We theoretically analyze a generic model-based policy gradient algorithm and provide a convergence guarantee for the optimized policy. We also empirically evaluate PAML on some benchmark problems, showing promising results.
翻译:本文探讨了学习基于模型的强化学习模式(MBRL)的问题。我们研究了MBRL算法的规划模块如何使用该模型,并提议示范学习模块应当纳入计划者使用模型的方式。这与传统的示范学习方法形成对照,例如基于最大可能性估计的模型,在不明确考虑模型和计划者相互作用的情况下学习环境预测模型。我们侧重于规划算法的政策梯度类型,并为纳入规划者如何使用模型的模型学习得出新的损失功能。我们称之为“政策-软件模型学习 ” (PAML) 。我们理论上分析了基于通用模型的政策梯度算法,并为优化政策提供了趋同保证。我们还从经验上评估了某些基准问题,显示了有希望的结果。