一般性非政策行动者-批评者 (Generalized Off-Policy Actor-Critic)

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.

翻译：我们提出了一个新的目标,即反事实目标,在持续强化学习(RL)环境中统一现有非政策梯度算法的目标。与常用的远征目标相比,我们的新目标在部署时可能会误导目标政策的执行情况,我们的新目标更好地预测了这种业绩。我们证明通用非政策梯度理论可以计算反事实目标的政策梯度,并使用一个强有力的方法从这一政策梯度中获取一个不偏不倚的样本,从而产生通用非政策行为者-批评算法。我们证明了Geoff-PAC相对于穆乔科机器人模拟任务中的现有算法的优点,这是当前深度RL基准中主要算法的第一个成功经验。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

40+阅读 · 2020年4月11日

生成式对抗网络先验贝叶斯推断，Bayesian Inference with Generative Adversarial Network Priors

专知会员服务

27+阅读 · 2020年2月18日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

80+阅读 · 2020年2月18日

深度强化学习策略梯度教程，53页ppt

专知会员服务

176+阅读 · 2020年2月1日