We present an analytical policy update rule that is independent of parametric function approximators. The policy update rule is suitable for optimizing general stochastic policies and has a monotonic improvement guarantee. It is derived from a closed-form solution to trust-region optimization using calculus of variation, following a new theoretical result that tightens existing bounds for policy improvement using trust-region methods. The update rule builds a connection between policy search methods and value function methods. Moreover, off-policy reinforcement learning algorithms can be derived from the update rule since it does not need to compute integration over on-policy states. In addition, the update rule extends immediately to cooperative multi-agent systems when policy updates are performed by one agent at a time.
翻译:我们提出了一个独立于参数功能近似器的分析性政策更新规则。政策更新规则适合优化一般随机政策,并具有单一改进保证。它源自于一种封闭式的解决办法,即采用变异微积分实现信任区域优化,采用新的理论结果,用信任区域方法收紧了政策改进的现有界限。更新规则在政策搜索方法和价值功能方法之间建立了联系。此外,从更新规则中可以得出政策外强化学习算法,因为不需要对政策国家的整合进行计算。此外,更新规则在政策更新由一个代理人进行时,立即延伸到合作性多剂系统。