Reinforcement learning is a promising framework for solving control problems, but its use in practical situations is hampered by the fact that reward functions are often difficult to engineer. Specifying goals and tasks for autonomous machines, such as robots, is a significant challenge: conventionally, reward functions and goal states have been used to communicate objectives. But people can communicate objectives to each other simply by describing or demonstrating them. How can we build learning algorithms that will allow us to tell machines what we want them to do? In this work, we investigate the problem of grounding language commands as reward functions using inverse reinforcement learning, and argue that language-conditioned rewards are more transferable than language-conditioned policies to new environments. We propose language-conditioned reward learning (LC-RL), which grounds language commands as a reward function represented by a deep neural network. We demonstrate that our model learns rewards that transfer to novel tasks and environments on realistic, high-dimensional visual environments with natural language commands, whereas directly learning a language-conditioned policy leads to poor performance.
翻译:强化学习是解决控制问题的一个大有希望的框架,但在实际情况下,由于奖励功能往往难以设计,因此其使用受到阻碍。为机器人等自主机器指定目标和任务是一项重大挑战:通常,奖励功能和目标状态被用来传达目标。但人们可以通过描述或展示来相互交流目标。我们如何建立学习算法,让我们能够告诉机器我们想要做什么?在这项工作中,我们调查将语言指令定位为利用反强化学习奖励功能的问题,并争论语言附加条件的奖励比语言附加条件的政策更可转让到新环境。我们提出语言附加条件的奖励学习(LC-RL),作为深层神经网络所代表的奖励功能。我们证明,我们的模型学习的是将自然语言指令转换到现实、高维观环境中的新任务和环境的奖励,而直接学习语言附加条件的政策则导致不良的表现。