二维视觉语言模型的开放世界3D场景理解 (Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models)

We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual inputs - a critical skill for robots to operate in the unstructured 3D world. Towards this end, we propose Semantic Abstraction (SemAbs), a framework that equips 2D Vision-Language Models (VLMs) with new 3D spatial capabilities, while maintaining their zero-shot robustness. We achieve this abstraction using relevancy maps extracted from CLIP, and learn 3D spatial and geometric reasoning skills on top of those abstractions in a semantic-agnostic manner. We demonstrate the usefulness of SemAbs on two open-world 3D scene understanding tasks: 1) completing partially observed objects and 2) localizing hidden objects from language descriptions. Experiments show that SemAbs can generalize to novel vocabulary, materials/lighting, classes, and domains (i.e., real-world scans) from training on limited 3D synthetic data. Code and data will be available at https://semantic-abstraction.cs.columbia.edu/

翻译：我们研究开放世界 3D 场景理解,这是一组任务,要求代理商以开放的词汇和外部视觉输入来解释其 3D 环境,这是机器人在无结构的 3D 世界上操作的关键技能。为此,我们提议使用SemAbs 语义抽象(SemAbs),这是一个为 2D 视觉语言模型(VLMs) 配备新的 3D 空间能力的框架,同时保持其零光度强度。我们利用从 CLIP 中提取的相关地图来实现这种抽象化,并且以语义和语义学方式在这些抽象内容之上学习 3D 空间和几何推理技巧。我们展示SemAbs在两个开放世界 3D 场理解任务上的作用:(1) 完成部分观测到的物体,(2) 将语言描述中的隐藏物体本地化。实验显示SemAbs能够从有限的 3D合成数据培训中概括到新词汇、材料/亮度、阶级和领域(即真实世界扫描) 。代码和数据动作将在 https://seclimbus/ action.