限制更新最新安全政策优化预测方法 (Constrained Update Projection Approach to Safe Policy Optimization)

from arxiv, Accepted by NeurIPS2022. arXiv admin note: substantial text overlap with arXiv:2202.07565; text overlap with arXiv:2002.06506 by other authors

Safe reinforcement learning (RL) studies problems where an intelligent agent has to not only maximize reward but also avoid exploring unsafe areas. In this study, we propose CUP, a novel policy optimization method based on Constrained Update Projection framework that enjoys rigorous safety guarantee. Central to our CUP development is the newly proposed surrogate functions along with the performance bound. Compared to previous safe RL methods, CUP enjoys the benefits of 1) CUP generalizes the surrogate functions to generalized advantage estimator (GAE), leading to strong empirical performance. 2) CUP unifies performance bounds, providing a better understanding and interpretability for some existing algorithms; 3) CUP provides a non-convex implementation via only first-order optimizers, which does not require any strong approximation on the convexity of the objectives. To validate our CUP method, we compared CUP against a comprehensive list of safe RL baselines on a wide range of tasks. Experiments show the effectiveness of CUP both in terms of reward and safety constraint satisfaction. We have opened the source code of CUP at https://github.com/RL-boxes/Safe-RL/tree/ main/CUP.

翻译：安全强化学习(RL)研究智能剂不仅必须最大限度地获得奖励,而且还必须避免探索不安全地区的问题。在这项研究中,我们建议CUP,这是基于严格安全保障的受控更新预测框架的新的政策优化方法;我们的CUP开发中心是新提议的代用功能,与性能约束一起。与以前的安全RL方法相比,CUP享受的好处是:(1) CUP将代用功能概括化为普遍优势估计仪(GAE),导致强有力的实证性表现。(2) CUP统一了性能界限,为某些现有算法提供了更好的理解和解释;(3) CUP仅通过一级优化器提供非convex执行,这不需要对目标的共性作任何强烈的近似。为了验证我们的CUP方法,我们将CUP与一系列任务的安全RL基线综合清单进行比较。实验显示CUP在奖励和安全约束性满意性两方面的有效性。我们在https://github.com/RUP/main/Safefle/Safe)打开CUP源码。