In partially observable reinforcement learning, offline training gives access to latent information which is not available during online training and/or execution, such as the system state. Asymmetric actor-critic methods exploit such information by training a history-based policy via a state-based critic. However, many asymmetric methods lack theoretical foundation, and are only evaluated on limited domains. We examine the theory of asymmetric actor-critic methods which use state-based critics, and expose fundamental issues which undermine the validity of a common variant, and limit its ability to address partial observability. We propose an unbiased asymmetric actor-critic variant which is able to exploit state information while remaining theoretically sound, maintaining the validity of the policy gradient theorem, and introducing no bias and relatively low variance into the training process. An empirical evaluation performed on domains which exhibit significant partial observability confirms our analysis, demonstrating that unbiased asymmetric actor-critic converges to better policies and/or faster than symmetric and biased asymmetric baselines.
翻译:在部分可见的强化学习中,离线培训提供了获得在线培训和(或)执行期间无法获得的潜在信息的途径,例如系统状态。非对称行为者-批评方法通过州级批评家培训基于历史的政策来利用这种信息。然而,许多不对称方法缺乏理论基础,并且仅在有限的领域进行评估。我们研究了使用州级批评者的不对称行为者-批评方法理论,并揭示了损害共同变量有效性并限制其部分可视性的根本问题。我们提出了一种公正、不对称的行为者-批评变量,它既能利用国家信息,又在理论上保持健全,保持政策梯度定理的有效性,在培训过程中不引入偏差和相对低的差异。对明显部分可视性领域进行的经验评估证实了我们的分析,表明无偏对称行为者-批评者-批评者与更好的政策和(或)比对称和偏对称不对称基线更快的一致。