In this work, we address the problem of image-goal navigation in the context of visually-realistic 3D environments. This task involves navigating to a location indicated by a target image in a previously unseen environment. Earlier attempts, including RL-based and SLAM-based approaches, have either shown poor generalization performance, or are heavily-reliant on pose/depth sensors. We present a novel method that leverages a cross-episode memory to learn to navigate. We first train a state-embedding network in a self-supervised fashion, and then use it to embed previously-visited states into a memory. In order to avoid overfitting, we propose to use data augmentation on the RGB input during training. We validate our approach through extensive evaluations, showing that our data-augmented memory-based model establishes a new state of the art on the image-goal navigation task in the challenging Gibson dataset. We obtain this competitive performance from RGB input only, without access to additional sensors such as position or depth.
翻译:在这项工作中,我们在视觉现实的 3D 环境中处理图像-目标导航问题。 这项任务涉及在先前的不为人知的环境中浏览目标图像显示的位置。 早期尝试,包括基于RL 和基于SLAM 的尝试,要么表现不佳,要么严重依赖表面/深度感应器。 我们提出了一个新颖的方法,利用交叉的记忆来学习导航。 我们首先以自我监督的方式培训州组成的网络,然后用它将先前访问过的状态嵌入记忆中。 为避免过度匹配,我们提议在培训期间使用RGB输入的数据增强。 我们通过广泛评估验证我们的方法,显示我们基于数据强化的记忆模型在挑战性的 Globbis 数据集中建立了关于图像-目标导航任务的新状态。 我们从 RGB 输入中只能获得这种竞争性的性能, 无法获取诸如位置或深度等额外的传感器 。