Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. The recent breakthrough of 2D open-vocabulary perception is largely driven by Internet-scale paired image-text data with rich vocabulary concepts. However, this success cannot be directly transferred to 3D scenarios due to the inaccessibility of large-scale 3D-text pairs. To this end, we propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D, which allows explicitly associating 3D and semantic-rich captions. Further, to facilitate coarse-to-fine visual-semantic representation learning from captions, we design hierarchical 3D-caption pairs, leveraging geometric constraints between 3D scenes and multi-view images. Finally, by employing contrastive learning, the model learns language-aware embeddings that connect 3D and text for open-vocabulary tasks. Our method not only remarkably outperforms baseline methods by 25.8% $\sim$ 44.7% hIoU and 14.5% $\sim$ 50.4% hAP$_{50}$ on open-vocabulary semantic and instance segmentation, but also shows robust transferability on challenging zero-shot domain transfer tasks. Code will be available at https://github.com/CVMI-Lab/PLA.
翻译:开放词汇场景理解旨在将附加注释标签空间之外的隐蔽类别本地化和识别为本地化和识别。 最近, 2D 开放词汇认知的突破在很大程度上是由互联网规模的配对图像文本数据和丰富的词汇概念驱动的。 但是,由于大规模3D文本对配对无法获取,这一成功不能直接转移到3D情景。 为此,我们提议通过3D 的多视图图像字幕,将预培训前的愿景语言(VL)基础模型中编码的知识注入本地化和识别。 3D 的多视图图像可以将3D和语义丰富的标题明确联系起来。此外,为了便利从标题学习的粗略到直观的视觉-语义表达,我们设计了3D级图像和多视图图像之间的几何限制。 最后,通过对比学习,模型学习了将3D和文本连接起来的多语言-认知嵌入模块基础模型。 我们的方法不仅明显优于基准方法25.8% $VA+4$44.7%的可理解性图像- MI+5.5%的域域域域中,还可以显示50%的可操作。