Reinforcement learning (RL) typically defines a discount factor as part of the Markov Decision Process. The discount factor values future rewards by an exponential scheme that leads to theoretical convergence guarantees of the Bellman equation. However, evidence from psychology, economics and neuroscience suggests that humans and animals instead have hyperbolic time-preferences. In this work we revisit the fundamentals of discounting in RL and bridge this disconnect by implementing an RL agent that acts via hyperbolic discounting. We demonstrate that a simple approach approximates hyperbolic discount functions while still using familiar temporal-difference learning techniques in RL. Additionally, and independent of hyperbolic discounting, we make a surprising discovery that simultaneously learning value functions over multiple time-horizons is an effective auxiliary task which often improves over a strong value-based RL agent, Rainbow.
翻译:强化学习(RL)通常将折扣系数定义为Markov决定程序的一部分。折扣系数通过指数式计划将未来奖励值作为Bellman方程式的理论趋同保证值。然而,心理学、经济学和神经科学的证据表明,人类和动物反而具有双曲时间偏差。在这项工作中,我们重新审视了RL的折扣基本原理,并通过执行一个通过双曲折扣运作的RL代理来弥补这一脱节。我们证明,一个简单的方法近似超双曲折扣功能,同时仍然在RL使用熟悉的时间偏差学习技术。此外,并且独立于超曲折扣,我们令人惊讶地发现,同时学习价值功能的多重时间偏差是一个有效的辅助任务,常常比一个强大的基于价值的RL代理(彩虹)有所改进。