3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.
翻译:三维视觉定位(3DVG)是从视觉-语言感知到机器人技术的关键桥梁,需要同时具备语言理解与三维场景推理能力。传统监督模型虽能利用显式三维几何信息,但由于三维视觉-语言数据集的稀缺性,以及相较于现代视觉-语言模型(VLM)在推理能力上的局限,其泛化性能受限。本文提出PanoGrounder——一种可泛化的三维视觉定位框架,通过将多模态全景表征与预训练的二维视觉-语言模型相结合,实现强大的视觉-语言推理。融合三维语义与几何特征的全景渲染图像作为二维与三维间的中间表征,具备两大优势:(1)仅需极少量适配即可直接输入视觉-语言模型;(2)凭借360度视场角保持长距离物体间关系。我们设计了一个三阶段流程:基于场景布局与几何结构布置紧凑的全景视点集;利用视觉-语言模型在每个全景渲染图像上对文本查询进行定位;通过几何升维将各视角预测融合为单一三维边界框。本方法在ScanRefer与Nr3D数据集上取得最先进性能,并在未见过的三维数据集及文本复述任务中展现出卓越的泛化能力。