We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator -- a multimodal, multilingual, multitask encoder-decoder. To train it, we bootstrap grounded landmark annotations on top of the Room-across-Room (RxR) dataset. Using text parsers, weak supervision from RxR's pose traces, and a multilingual image-text encoder trained on 1.8b images, we identify 971k English, Hindi and Telugu landmark descriptions and ground them to specific regions in panoramas. On Room-to-Room, human wayfinders obtain success rates (SR) of 71% following MARKY-MT5's instructions, just shy of their 75% SR following human instructions -- and well above SRs with other generators. Evaluations on RxR's longer, diverse paths obtain 61-64% SRs on three languages. Generating such high-quality navigation instructions in novel environments is a step towards conversational navigation tools and could facilitate larger-scale training of instruction-following agents.
翻译:我们研究从室内通道上捕获的360度图像自动生成导航指令。 现有发电机的视觉地面定位不良, 导致它们依赖语言前科和幻觉物体。 我们的MARKY- MT5系统以视觉地标为重点, 包括一个第一阶段的地标探测器和一个二级的发电机 -- -- 一个多语种的多语种多语种、 多语种的解码器。 为了培训它, 我们用文字解析器进行自动生成导航指令。 使用文字解析器, RxR 的姿势痕迹监督不力, 以及用1.8b 图像训练的多语言图像解码器, 我们找出971k的英语、印地语和Telugu的标志性描述, 并将其放置在全美州的特定区域。 在房间到 Room, 人类的选取成功率为 71% (SR), 只是在MARKY- MT5 的指令上, 他们的75% SR 的缩略语为75% -- -- 远高于其他发电机。 对RxR 的姿势踪迹痕迹的变微路径进行评估, 获得61- 64 高级导航工具, 在三种语言上的高级导航工具。