FSR-VLN：基于分层多模态场景图的快速与慢速推理视觉语言导航 (FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph)

Visual-Language Navigation (VLN) is a fundamental challenge in robotic systems, with broad applications for the deployment of embodied agents in real-world environments. Despite recent advances, existing approaches are limited in long-range spatial reasoning, often exhibiting low success rates and high inference latency, particularly in long-range navigation tasks. To address these limitations, we propose FSR-VLN, a vision-language navigation system that combines a Hierarchical Multi-modal Scene Graph (HMSG) with Fast-to-Slow Navigation Reasoning (FSR). The HMSG provides a multi-modal map representation supporting progressive retrieval, from coarse room-level localization to fine-grained goal view and object identification. Building on HMSG, FSR first performs fast matching to efficiently select candidate rooms, views, and objects, then applies VLM-driven refinement for final goal selection. We evaluated FSR-VLN across four comprehensive indoor datasets collected by humanoid robots, utilizing 87 instructions that encompass a diverse range of object categories. FSR-VLN achieves state-of-the-art (SOTA) performance in all datasets, measured by the retrieval success rate (RSR), while reducing the response time by 82% compared to VLM-based methods on tour videos by activating slow reasoning only when fast intuition fails. Furthermore, we integrate FSR-VLN with speech interaction, planning, and control modules on a Unitree-G1 humanoid robot, enabling natural language interaction and real-time navigation.

翻译：视觉语言导航（VLN）是机器人系统中的一项基础性挑战，在现实环境中部署具身智能体方面具有广泛的应用前景。尽管近期研究取得了进展，但现有方法在长距离空间推理方面仍存在局限，往往表现出较低的成功率和较高的推理延迟，尤其是在长距离导航任务中。为应对这些限制，本文提出了FSR-VLN，一种结合分层多模态场景图（HMSG）与快速-慢速导航推理（FSR）的视觉语言导航系统。HMSG提供了一种支持渐进式检索的多模态地图表示，从粗略的房间级定位到细粒度的目标视角与物体识别。基于HMSG，FSR首先执行快速匹配以高效筛选候选房间、视角和物体，随后采用视觉语言模型驱动的精细化处理完成最终目标选择。我们在人形机器人采集的四个综合性室内数据集上评估了FSR-VLN，使用了涵盖多样化物体类别的87条指令。FSR-VLN在所有数据集中均实现了检索成功率（RSR）指标的先进性能，同时通过在快速直觉推理失败时才激活慢速推理，相比基于视觉语言模型的方法在巡视视频中将响应时间降低了82%。此外，我们将FSR-VLN与语音交互、规划及控制模块集成于Unitree-G1人形机器人，实现了自然语言交互与实时导航功能。