The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate. First, the agent imagines a sequence of observations corresponding to the provided language description using video generative models. Next, these imagined observations are projected into the target environment domain. Finally, an agent pretrained in the target environment with unsupervised RL instantly imitates the projected observation sequence through a closed-form solution. To the best of our knowledge, our method, RLZero, is the first approach to show direct language-to-behavior generation abilities on a variety of tasks and environments without any in-domain supervision. We further show that components of RLZero can be used to generate policies zero-shot from cross-embodied videos, such as those available on YouTube, even for complex embodiments like humanoids.
翻译:奖励假说指出,所有目标和意图都可以理解为对接收到的标量奖励信号的最大化。然而在实践中,定义这样的奖励信号极其困难,因为人类往往无法预测与奖励函数对应的最优行为。自然语言为强化学习(RL)智能体提供了一种直观的指令替代方案,但以往基于语言条件的方法要么需要昂贵的监督,要么需要在给定语言指令时进行测试时训练。本研究提出一种新方法,利用仅通过无标注离线交互(无需任务特定监督或标注轨迹)预训练的RL智能体,实现从任意自然语言指令到策略的零样本测试时推断。我们引入一个包含三个步骤的框架:想象、投影和模仿。首先,智能体通过视频生成模型想象出与所提供语言描述对应的观测序列;接着,将这些想象的观测投影到目标环境域中;最后,通过无监督RL在目标环境中预训练的智能体,通过闭式解即时模仿投影后的观测序列。据我们所知,我们的方法RLZero是首个在多种任务和环境中,无需任何领域监督即展示出直接从语言到行为生成能力的方法。我们进一步证明,RLZero的组件可用于从跨具身视频(如YouTube上的视频)零样本生成策略,即使对于类人机器人等复杂具身形态也适用。