We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.
翻译:我们提出了一个新的目标,即反事实目标,在持续强化学习(RL)环境中统一现有非政策梯度算法的目标。与常用的远征目标相比,我们的新目标在部署时可能会误导目标政策的执行情况,我们的新目标更好地预测了这种业绩。我们证明通用非政策梯度理论可以计算反事实目标的政策梯度,并使用一个强有力的方法从这一政策梯度中获取一个不偏不倚的样本,从而产生通用非政策行为者-批评算法。我们证明了Geoff-PAC相对于穆乔科机器人模拟任务中的现有算法的优点,这是当前深度RL基准中主要算法的第一个成功经验。