学习人类喜好和阿塔里示威的奖赏 (Reward learning from human preferences and demonstrations in Atari)

To solve complex real-world problems with reinforcement learning, we cannot rely on manually specified reward functions. Instead, we can have humans communicate an objective to the agent directly. In this work, we combine two approaches to learning from human feedback: expert demonstrations and trajectory preferences. We train a deep neural network to model the reward function and use its predicted reward to train an DQN-based deep reinforcement learning agent on 9 Atari games. Our approach beats the imitation learning baseline in 7 games and achieves strictly superhuman performance on 2 games without using game rewards. Additionally, we investigate the goodness of fit of the reward model, present some reward hacking problems, and study the effects of noise in the human labels.

翻译：解决强化学习的复杂现实问题, 我们无法依靠手工指定的奖励功能。相反, 我们可以让人类直接向代理人传达目标。在这项工作中, 我们结合两种方法从人类反馈中学习: 专家演示和轨迹偏好。我们训练了一个深层神经网络来模拟奖励功能, 并用其预测的奖励来训练以DQN为基础的9 Atari 游戏的深强化学习代理。我们的方法在7个游戏中比模仿学习基线更强, 在2个游戏中实现严格的超人表现而不使用游戏奖赏。此外, 我们调查奖励模式是否适合, 展示一些奖励黑客的问题, 并研究噪音在人类标签中的影响。