一般性政策评价和通过学习查明少数但关键国家加以改进 (General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States)

Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Networks to learn a single value function for evaluating (and thus helping to improve) any policy represented by a deep neural network (NN). The method yields competitive experimental results. In continuous control problems with infinitely many states, our value function minimizes its prediction error by simultaneously learning a small set of `probing states' and a mapping from actions produced in probing states to the policy's return. The method extracts crucial abstract knowledge about the environment in form of very few states sufficient to fully specify the behavior of many policies. A policy improves solely by changing actions in probing states, following the gradient of the value function's predictions. Surprisingly, it is possible to clone the behavior of a near-optimal policy in Swimmer-v3 and Hopper-v3 environments only by knowing how to act in 3 and 5 such learned states, respectively. Remarkably, our value function trained to evaluate NN policies is also invariant to changes of the policy architecture: we show that it allows for zero-shot learning of linear policies competitive with the best policy seen during training. Our code is public.

翻译：学习评估和改进政策是强化学习(RL)的一个核心问题。传统的RL算法学习了为单一政策定义的价值函数。最近探索的一个竞争性替代办法是学习许多政策的单一价值函数。在这里,我们结合了基于参数的参数值函数的行为者-批评架构,以及政策评价网络的政策嵌入,学习了用于评价(从而帮助改进)任何政策的一个单一价值功能(NN) 。方法产生竞争性实验结果。在无数国家的持续控制问题中,我们的价值函数通过同时学习一套小的“激励状态”和从预测状态中产生的行动的绘图来尽量减少其预测错误。这种方法以极少数国家的形式提取关于环境的至关重要的抽象知识,以充分说明许多政策的行为。政策改进的唯一方式是改变预测状态中的行动,遵循价值不变的数值预测。令人惊讶的是,我们的价值函数在Swimmer-v3 和 Hopper-NF 3 中,我们的价值函数通过了解我们所了解的最佳政策结构来复制接近最优化的政策行为,这只能显示我们所了解的最佳政策在5号中学会了我们的最佳政策结构。