带轨迹反馈的强化学习 (Reinforcement Learning with Trajectory Feedback)

The standard feedback model of reinforcement learning requires revealing the reward of every visited state-action pair. However, in practice, it is often the case that such frequent feedback is not available. In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as \emph{trajectory feedback}. Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory. We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret. For cases where the transition model is unknown, we offer a hybrid optimistic-Thompson Sampling approach that results in a tractable algorithm.

翻译：强化学习的标准反馈模式要求披露每个被访问的州-州-行动对应方的奖赏。然而,在实践中,往往没有如此频繁的反馈。在这项工作中,我们迈出第一步,放松这一假设,需要较弱的反馈形式,我们称之为\emph{traffory communication}。我们不是观察每次行动后获得的奖赏,而是假设我们只得到一个分数,它代表了代理人所观察到的整个轨迹的质量,即通过这一轨迹获得的所有奖赏的总和。我们根据对已知和未知的过渡模式案例的未知奖赏的最小估计,将强化学习算法推广到这一环境,并通过分析遗憾来研究这些算法的绩效。对于未知的过渡模式,我们提供了一种混合的乐观-Thompson抽样方法,其结果是一种可移动的算法。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

近期必读的六篇 ICML 2020【元学习（Meta Learning）】相关论文

专知会员服务

45+阅读 · 2020年9月25日

可解释强化学习，Explainable Reinforcement Learning: A Survey

专知会员服务

131+阅读 · 2020年5月14日

【Google】监督对比学习，Supervised Contrastive Learning

专知会员服务

75+阅读 · 2020年4月24日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日