重新思考价值观职能学习,以便在强化学习中普及 (Rethinking Value Function Learning for Generalization in Reinforcement Learning)

We focus on the problem of training RL agents on multiple training environments to improve observational generalization performance. In prior methods, policy and value networks are separately optimized using a disjoint network architecture to avoid interference and obtain a more accurate value function. We identify that the value network in the multiple-environment setting is more challenging to optimize and prone to overfitting training data than in the conventional single-environment setting. In addition, we find that appropriate regularization of the value network is required for better training and test performance. To this end, we propose Delayed-Critic Policy Gradient (DCPG), which implicitly penalizes the value estimates by optimizing the value network less frequently with more training data than the policy network, which can be implemented using a shared network architecture. Furthermore, we introduce a simple self-supervised task that learns the forward and inverse dynamics of environments using a single discriminator, which can be jointly optimized with the value network. Our proposed algorithms significantly improve observational generalization performance and sample efficiency in the Procgen Benchmark.

翻译：我们注重在多种培训环境中培训RL代理人员以提高观察性一般化绩效的问题。在以往的方法中,政策和价值网络被分别优化,使用互不相连的网络结构避免干扰,并获得更准确的价值功能。我们发现,在多种环境环境中,价值网络比传统的单一环境环境中更难优化,更易对培训数据进行过度匹配。此外,我们发现,需要适当规范价值网络,才能更好地进行培训和测试性能。为此,我们提议延迟批评性政策分级(DCPG),通过利用比政策网络更多的培训数据优化价值网络,从而不那么频繁地对价值估计进行惩罚。此外,我们引入了简单的自我监督任务,利用单一的导师来学习环境的前瞻性和反向动态,而后者可以与价值网络共同优化。我们提议的算法极大地改进了Procgen基准的观察性一般化性能和抽样效率。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

专知会员服务

39+阅读 · 2020年11月3日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs