SAAC: 安全强化学习,作为反向的行为者-批评者游戏 (SAAC: Safe Reinforcement Learning as an Adversarial Game of Actor-Critics)

Although Reinforcement Learning (RL) is effective for sequential decision-making problems under uncertainty, it still fails to thrive in real-world systems where risk or safety is a binding constraint. In this paper, we formulate the RL problem with safety constraints as a non-zero-sum game. While deployed with maximum entropy RL, this formulation leads to a safe adversarially guided soft actor-critic framework, called SAAC. In SAAC, the adversary aims to break the safety constraint while the RL agent aims to maximize the constrained value function given the adversary's policy. The safety constraint on the agent's value function manifests only as a repulsion term between the agent's and the adversary's policies. Unlike previous approaches, SAAC can address different safety criteria such as safe exploration, mean-variance risk sensitivity, and CVaR-like coherent risk sensitivity. We illustrate the design of the adversary for these constraints. Then, in each of these variations, we show the agent differentiates itself from the adversary's unsafe actions in addition to learning to solve the task. Finally, for challenging continuous control tasks, we demonstrate that SAAC achieves faster convergence, better efficiency, and fewer failures to satisfy the safety constraints than risk-averse distributional RL and risk-neutral soft actor-critic algorithms.

翻译：虽然强化学习(RL)在不确定情况下对顺序决策问题有效,但在现实世界系统中,风险或安全是约束性制约因素,它仍然未能在现实世界系统中兴起。在本文中,我们将安全限制问题作为非零和游戏来表述。虽然在最大变温RL的部署中,这种配方可以导致一个安全的对抗性引导软体行为者-批评框架,称为SAAC。在SAAC中,敌对方的目的是打破安全限制,而RL代理商的目的是尽量扩大对手政策下的限制价值功能。对代理人价值功能的安全限制只是作为代理人和对手政策之间的一个反弹术语。与以往的做法不同,SAAC可以处理不同的安全标准,如安全勘探、中度风险敏感性和CVAR类似的一致风险敏感性。我们说明了这些制约的对手设计。然后,在每一种变异端中,我们展示了代理人除了学习解决这项任务之外还把自己与对手的不安全行动区别开来。最后,对于代理人价值功能的安全限制只是表现为代理人和对手政策之间的一个反差的术语。与以前的做法不同,我们证明SAAC能够更快速地稳定、更接近、更接近于安全性风险。