利用目标网络打破致命三合会 (Breaking the Deadly Triad with a Target Network)

The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously. In this paper, we investigate the target network as a tool for breaking the deadly triad, providing theoretical support for the conventional wisdom that a target network stabilizes training. We first propose and analyze a novel target network update rule which augments the commonly used Polyak-averaging style update with two projections. We then apply the target network and ridge regularization in several divergent algorithms and show their convergence to regularized TD fixed points. Those algorithms are off-policy with linear function approximation and bootstrapping, spanning both policy evaluation and control, as well as both discounted and average-reward settings. In particular, we provide the first convergent linear $Q$-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.

翻译：致命的三合会是指当它同时使用非政策性学习、功能近似和靴子时强化学习算法的不稳定性。在本文中,我们调查目标网络,将其作为打破致命三合会的工具,为目标网络稳定培训的传统智慧提供理论支持。我们首先提出并分析一个新的目标网络更新规则,用两个预测来补充常用的多功能稳定风格更新。我们然后在若干不同的算法中应用目标网络和峰值正规化,并显示它们与正规化的TD固定点的趋同。这些算法具有线性功能近似和串行,覆盖了政策评价和控制,以及折扣和平均回报环境。特别是,我们提供了第一个非限制性和变化的行为政策下的趋同线性直线性Q$学习算法,而没有双重优化。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

专知会员服务

39+阅读 · 2020年11月3日

【华为-诺亚实验室】动态BERT, Dynamic BERT with Adaptive Width and Depth

专知会员服务

24+阅读 · 2020年4月13日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning