重新思考愿景和语言导航前空道 (Rethinking the Spatial Route Prior in Vision-and-Language Navigation)

Vision-and-language navigation (VLN) is a trending topic which aims to navigate an intelligent agent to an expected position through natural language instructions. This work addresses the task of VLN from a previously-ignored aspect, namely the spatial route prior of the navigation scenes. A critically enabling innovation of this work is explicitly considering the spatial route prior under several different VLN settings. In a most information-rich case of knowing environment maps and admitting shortest-path prior, we observe that given an origin-destination node pair, the internal route can be uniquely determined. Thus, VLN can be effectively formulated as an ordinary classification problem over all possible destination nodes in the scenes. Furthermore, we relax it to other more general VLN settings, proposing a sequential-decision variant (by abandoning the shortest-path route prior) and an explore-and-exploit scheme (for addressing the case of not knowing the environment maps) that curates a compact and informative sub-graph to exploit. As reported by [34], the performance of VLN methods has been stuck at a plateau in past two years. Even with increased model complexity, the state-of-the-art success rate on R2R validation-unseen set has stayed around 62% for single-run and 73% for beam-search with model-ensemble. We have conducted comprehensive evaluations on both R2R and R4R, and surprisingly found that utilizing the spatial route priors may be the key of breaking above-mentioned performance ceiling. For example, on R2R validation-unseen set, when the number of discrete nodes explored is about 40, our single-model success rate reaches 73%, and increases to 78% if a Speaker model is ensembled, which significantly outstrips previous state-of-the-art VLN-BERT with 3 models ensembled.

翻译：视觉和语言导航( VLN) 是一个趋势性议题, 目的是将智能剂引导到自然语言指令的预期位置。这项工作从导航场景之前的空间路线这一前点处理 VLN 的任务。这项工作的一个至关重要的促进创新正在明确考虑多个不同的 VLN 设置下之前的空间路线。在了解环境地图和承认最短路径的最丰富信息的例子中, 我们观察到, 在一个源地点节点对内线可以单独确定。因此, VLN 可以在所有可能的场景目的地节点上有效地作为普通的分类问题来制定。此外, 我们将它从以往的VLN 任务从一个前点处处理, 即导航场景之前的空间路线。在了解环境地图并接受最短路径之前, 我们观察到, 以源地点节点节点节点节点为底点, 在过去两年中, 将 VLN 方法的性能被困在高点点上。即使模型复杂度提高了 VL 方向 4, 我们的运行前点点为 R-%, 的运行前点点成功率为 R 。在 R 的模型上, 运行前点的的运行前点点点点和的运行的运行前点为 R- b 的的运行运行的运行的运行的运行前点的运行点为运行点为 R- b 运行前点的的的。