Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling "last-mile" navigation challenges.
翻译:视觉语言模型(VLMs)在基于显式指令的导航任务中已取得显著进展;然而,其在动态城市环境中解析隐含人类需求(例如“我口渴了”)的能力仍待深入探索。本文提出CitySeeker——一个新颖的基准测试框架,旨在评估VLMs在探索具身化城市导航以满足隐含需求时的空间推理与决策能力。CitySeeker涵盖8个城市的6,440条轨迹,捕捉了7种目标驱动场景中多样化的视觉特征与隐含需求。大量实验表明,即使表现最优的模型(如Qwen2.5-VL-32B-Instruct)任务完成率也仅为21.1%。我们发现其关键瓶颈在于长时域推理中的误差累积、空间认知不足以及经验记忆缺失。为深入分析这些瓶颈,我们受人类认知地图强调迭代观察-推理循环与自适应路径优化的启发,研究了一系列探索性策略——回溯机制、增强空间认知与基于记忆的检索(BCR)。本分析为开发具备应对“最后一公里”导航挑战所需鲁棒空间智能的VLMs提供了可操作的见解。