Intelligent agents need to select long sequences of actions to solve complex tasks. While humans easily break down tasks into subgoals and reach them through millions of muscle commands, current artificial intelligence is limited to tasks with horizons of a few hundred decisions, despite large compute budgets. Research on hierarchical reinforcement learning aims to overcome this limitation but has proven to be challenging, current methods rely on manually specified goal spaces or subtasks, and no general solution exists. We introduce Director, a practical method for learning hierarchical behaviors directly from pixels by planning inside the latent space of a learned world model. The high-level policy maximizes task and exploration rewards by selecting latent goals and the low-level policy learns to achieve the goals. Despite operating in latent space, the decisions are interpretable because the world model can decode goals into images for visualization. Director outperforms exploration methods on tasks with sparse rewards, including 3D maze traversal with a quadruped robot from an egocentric camera and proprioception, without access to the global position or top-down view that was used by prior work. Director also learns successful behaviors across a wide range of environments, including visual control, Atari games, and DMLab levels.
翻译:智能分子需要选择长期的一系列行动,以完成复杂的任务。虽然人类很容易将任务分成几个子目标,并通过数百万个肌肉指令来达到这些次级目标,但目前的人工智能仅限于有几百项决定的视野的任务,尽管计算预算庞大。关于等级强化学习的研究旨在克服这一限制,但已证明具有挑战性,目前的方法依靠人工指定的目标空间或子任务,而没有一般的解决办法。我们引入了主任,这是通过规划一个有知识的世界模型的隐蔽空间,直接从像素中学习等级行为的实用方法。高级别政策通过选择潜在目标和低层次政策学习实现目标,最大限度地扩大任务和探索的回报。尽管在潜在空间运作,但是这些决定是可以解释的,因为世界模型可以将目标破解为视觉化的图像。主任在任务上,包括3D Maze Tratersal, 使用四倍的机器人,从一个以自我中心为中心摄像机和读取的机器人,无法进入全球位置或自上而下方的视角。主任还学习了在广泛的视觉环境中的成功行为,包括DML控制。