While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.
翻译:尽管当前多模态模型能够基于二维图像回答问题,但其缺乏内在的三维物体感知能力,限制了其在三维场景中理解空间关系与深度线索的能力。本研究提出N3D-VLM,一种新颖的统一框架,将原生三维物体感知与三维感知视觉推理无缝集成,实现精确的三维定位与可解释的空间理解。与直接从RGB/RGB-D输入预测答案的传统端到端模型不同,我们的方法赋予模型原生三维物体感知能力,使其能够依据文本描述直接在三维空间中定位物体。基于精确的三维物体定位,模型进一步在三维空间中进行显式推理,实现更具可解释性和结构化的空间理解。为支持这些能力的稳健训练,我们开发了可扩展的数据构建流程,利用深度估计将大规模二维标注提升至三维空间,显著增加了三维物体定位数据的多样性与覆盖范围,生成的数据集规模超过现有最大单图像三维检测数据集的六倍。此外,该流程生成了针对三维思维链推理的空间问答数据集,促进三维物体定位与三维空间推理的联合训练。实验结果表明,我们的统一框架不仅在三维定位任务上达到最先进的性能,而且在视觉语言模型的三维空间推理方面持续超越现有方法。