Reinforcement learning methods typically use Deep Neural Networks to approximate the value functions and policies underlying a Markov Decision Process. Unfortunately, DNN-based RL suffers from a lack of explainability of the resulting policy. In this paper, we instead approximate the policy and value functions using an optimization problem, taking the form of Quadratic Programs (QPs). We propose simple tools to promote structures in the QP, pushing it to resemble a linear MPC scheme. A generic unstructured QP offers high flexibility for learning, while a QP having the structure of an MPC scheme promotes the explainability of the resulting policy, additionally provides ways for its analysis. The tools we propose allow for continuously adjusting the trade-off between the former and the latter during learning. We illustrate the workings of our proposed method with the resulting structure using a point-mass task.
翻译:强化学习方法通常使用深神经网络来估计作为Markov决策程序基础的价值功能和政策。 不幸的是,基于 DNN RL 的DNNRL 缺乏对由此形成的政策的解释性。在本文件中,我们用优化问题来比较政策和价值功能,采取“二次曲线方案”的形式。我们提出了促进QP结构的简单工具,推动其类似于线性MPC 计划。通用的非结构化QP 提供了很高的学习灵活性,而具有MPC 计划结构的QP 则促进了该政策的解释性,为分析提供了更多的方法。我们提议的工具允许在学习过程中不断调整前者与后者之间的权衡。我们用点质量任务来说明我们拟议方法与由此产生的结构的运作情况。我们用点质量任务来说明我们所提议的方法的运作情况。