Visual Place Recognition (VPR) in long-term deployment requires reasoning beyond pixel similarity: systems must make transparent, interpretable decisions that remain robust under lighting, weather and seasonal change. We present Text2Graph VPR, an explainable semantic localization system that converts image sequences into textual scene descriptions, parses those descriptions into structured scene graphs, and reasons over the resulting graphs to identify places. Scene graphs capture objects, attributes and pairwise relations; we aggregate per-frame graphs into a compact place representation and perform retrieval with a dual-similarity mechanism that fuses learned Graph Attention Network (GAT) embeddings and a Shortest-Path (SP) kernel for structural matching. This hybrid design enables both learned semantic matching and topology-aware comparison, and -- critically -- produces human-readable intermediate representations that support diagnostic analysis and improve transparency in the decision process. We validate the system on Oxford RobotCar and MSLS (Amman/San Francisco) benchmarks and demonstrate robust retrieval under severe appearance shifts, along with zero-shot operation using human textual queries. The results illustrate that semantic, graph-based reasoning is a viable and interpretable alternative for place recognition, particularly suited to safety-sensitive and resource-constrained settings.
翻译:长期部署中的视觉地点识别(VPR)需要超越像素相似度的推理:系统必须做出透明、可解释的决策,并在光照、天气和季节变化下保持鲁棒性。我们提出了Text2Graph VPR,一个可解释的语义定位系统,它将图像序列转换为文本场景描述,将这些描述解析为结构化的场景图,并基于生成的图进行推理以识别地点。场景图捕获对象、属性及其成对关系;我们将逐帧图聚合为紧凑的地点表示,并采用一种双重相似性机制进行检索,该机制融合了学习到的图注意力网络(GAT)嵌入和用于结构匹配的最短路径(SP)核。这种混合设计既支持学习到的语义匹配,也支持拓扑感知比较,并且——关键的是——产生了人类可读的中间表示,这些表示支持诊断分析并提高了决策过程的透明度。我们在Oxford RobotCar和MSLS(安曼/旧金山)基准测试上验证了该系统,展示了其在严重外观变化下的鲁棒检索能力,以及使用人类文本查询进行零样本操作的能力。结果表明,基于语义和图的推理是地点识别的一种可行且可解释的替代方案,尤其适用于安全敏感和资源受限的场景。