Learning about many things can provide numerous benefits to a reinforcement learning system. For example, learning many auxiliary value functions, in addition to optimizing the environmental reward, appears to improve both exploration and representation learning. The question we tackle in this paper is how to sculpt the stream of experience---how to adapt the system's behaviour---to optimize the learning of a collection of value functions. A simple answer is to compute an intrinsic reward based on the statistics of each auxiliary learner, and use reinforcement learning to maximize that intrinsic reward. Unfortunately, implementing this simple idea has proven difficult, and thus has been the focus of decades of study. It remains unclear which of the many possible measures of learning would work well in a parallel learning setting where environmental reward is extremely sparse or absent. In this paper, we investigate and compare different intrinsic reward mechanisms in a new bandit-like parallel-learning testbed. We discuss the interaction between reward and prediction learners and highlight the importance of introspective prediction learners: those that increase their rate of learning when progress is possible, and decrease when it is not. We provide a comprehensive empirical comparison of 15 different rewards, including well-known ideas from reinforcement learning and active learning. Our results highlight a simple but seemingly powerful principle: intrinsic rewards based on the amount of learning can generate useful behaviour, if each individual learner is introspective.
翻译:有关许多事物的学习可以给强化学习系统带来许多好处。例如,学习许多辅助价值功能,除了优化环境奖励之外,还学习许多辅助价值功能,似乎能够改善勘探和代表学习。我们本文处理的问题是如何雕塑系统行为-优化学习价值功能集的经验流,以优化系统行为-优化对价值功能的收集。一个简单的答案就是根据每个辅助学习者的统计数字计算内在的奖励,并利用强化学习来尽量扩大内在的奖励。不幸的是,实践这一简单的想法证明是困难的,因此一直是数十年研究的重点。许多可能学习措施中的哪些措施在环境奖励极为稀少或不存在的平行学习环境中会很好地发挥作用。在本论文中,我们调查并比较不同的内在奖赏机制,在一个新的带宽的平行学习测试中,我们讨论奖赏和预测学习者之间的相互作用,强调内动预测学习者的重要性:那些在可能取得进步时提高学习速度的人,在无法取得成绩时会减少。我们提供了15种不同奖赏的综合实证比较,但是我们对各种奖赏的有力原则进行了全面的比较,包括从学习中取得明显的学习成果。