示范强化学习中误差 (Objective Mismatch in Model-based Reinforcement Learning)

Model-based reinforcement learning (MBRL) has been shown to be a powerful framework for data-efficiently learning control of continuous tasks. Recent work in MBRL has mostly focused on using more advanced function approximators and planning schemes, with little development of the general framework. In this paper, we identify a fundamental issue of the standard MBRL framework -- what we call the objective mismatch issue. Objective mismatch arises when one objective is optimized in the hope that a second, often uncorrelated, metric will also be optimized. In the context of MBRL, we characterize the objective mismatch between training the forward dynamics model w.r.t.~the likelihood of the one-step ahead prediction, and the overall goal of improving performance on a downstream control task. For example, this issue can emerge with the realization that dynamics models effective for a specific task do not necessarily need to be globally accurate, and vice versa globally accurate models might not be sufficiently accurate locally to obtain good control performance on a specific task. In our experiments, we study this objective mismatch issue and demonstrate that the likelihood of one-step ahead predictions is not always correlated with control performance. This observation highlights a critical limitation in the MBRL framework which will require further research to be fully understood and addressed. We propose an initial method to mitigate the mismatch issue by re-weighting dynamics model training. Building on it, we conclude with a discussion about other potential directions of research for addressing this issue.

翻译：以模型为基础的强化学习(MBRL)已被证明是数据高效学习持续任务控制的一个强有力的框架。MBRL最近的工作主要侧重于使用更先进的功能匹配器和规划计划,而总体框架的开发很少。在本文件中,我们确定了标准MBRL框架的根本问题 -- -- 我们称之为客观不匹配问题。当一个目标被优化,希望有一个目标能够优化第二个、往往不相关、衡量标准,从而实现最佳化。在MBRL方面,我们把培训前动态模型 w.r.t. 与改进下游控制任务绩效的总体目标之间的目标不匹配定性为特征;例如,我们发现一个基本问题,就是认识到为某项具体任务有效的动态模型不一定需要全球准确,反之相反,一个目标准确模型在当地可能不够准确,无法在具体任务上取得良好的控制模式绩效。我们在实验中研究这个目标不匹配问题,并表明,先一步预测的可能性与前期动态讨论并非总能与控制业绩挂钩。这一观察将突出一个关键的限制,我们需要通过研究方法来完成一个关键的研究方向。