Interactive robots navigating photo-realistic environments face challenges underlying vision-and-language navigation (VLN), but in addition, they need to be trained to handle the dynamic nature of dialogue. However, research in Cooperative Vision-and-Dialog Navigation (CVDN), where a navigator interacts with a guide in natural language in order to reach a goal, treats the dialogue history as a VLN-style static instruction. In this paper, we present VISITRON, a navigator better suited to the interactive regime inherent to CVDN by being trained to: i) identify and associate object-level concepts and semantics between the environment and dialogue history, ii) identify when to interact vs. navigate via imitation learning of a binary classification head. We perform extensive ablations with VISITRON to gain empirical insights and improve performance on CVDN. VISITRON is competitive with models on the static CVDN leaderboard. We also propose a generalized interactive regime to fine-tune and evaluate VISITRON and future such models with pre-trained guides for adaptability.
翻译:探索摄影现实环境的互动机器人面临视觉和语言导航(VLN)的内在挑战,但是,他们也需要接受培训,以掌握对话的动态性质。然而,对合作视野和迪亚洛格导航(CVDN)的研究(CVDN),导航员与自然语言指南进行互动,以达到目的,将对话历史视为VLN式静态教学。在本文中,我们介绍了VISITRON,这是一个更适合CVDN内在互动制度的导航员,接受培训是为了:(一) 查明并结合环境与对话历史之间的对象级概念和语义,(二) 确定何时通过模拟学习二进制分类头来互动与浏览。我们与VISITRON进行了广泛的推介,以获得经验性洞察,提高CVDN的绩效。VISITRON与静态 CVDN领导板上的模型竞争。我们还提议了一个通用的互动式互动制度,以微调和评价VISITRON和将来的模型,并用预先培训的适应性指南。