How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. delivering a package)? We study this question by integrating a generic perceptual skill set (e.g. a distance estimator, an edge detector, etc.) within a reinforcement learning framework--see Figure 1. This skill set (hereafter mid-level perception) provides the policy with a more processed state of the world compared to raw images. We find that using a mid-level perception confers significant advantages over training end-to-end from scratch (i.e. not leveraging priors) in navigation-oriented tasks. Agents are able to generalize to situations where the from-scratch approach fails and training becomes significantly more sample efficient. However, we show that realizing these gains requires careful selection of the mid-level perceptual skills. Therefore, we refine our findings into an efficient max-coverage feature set that can be adopted in lieu of raw images. We perform our study in completely separate buildings for training and testing and compare against visually blind baseline policies and state-of-the-art feature learning methods.
翻译:与原始图像相比,这一技能集(此后是中等水平的认知集)能为该政策提供比原始图像更经过处理的世界状态提供多少?我们发现,使用中层感知组比从零到末培训(即不利用前期)在面向导航的任务中具有很大的优势。 代理人能够将通用的认知技能集(如远程测距仪、边缘探测器等)纳入强化学习框架-见图1。然而,我们表明,实现这些成果需要仔细选择中层感知技能。因此,我们将我们的调查结果改进为一套效率最高覆盖特征组,可以取代原始图像。我们在完全分开的建筑中进行研究,用于培训和测试,并与视觉盲基线政策和状态特征学习方法进行比较。