We propose CLIP-Fields, an implicit scene model that can be trained with no direct human supervision. This model learns a mapping from spatial locations to semantic embedding vectors. The mapping can then be used for a variety of tasks, such as segmentation, instance identification, semantic search over space, and view localization. Most importantly, the mapping can be trained with supervision coming only from web-image and web-text trained models such as CLIP, Detic, and Sentence-BERT. When compared to baselines like Mask-RCNN, our method outperforms on few-shot instance identification or semantic segmentation on the HM3D dataset with only a fraction of the examples. Finally, we show that using CLIP-Fields as a scene memory, robots can perform semantic navigation in real-world environments. Our code and demonstrations are available here: https://mahis.life/clip-fields/
翻译:我们提议CLIP- Fields, 这是一种隐含的场景模型, 无需直接由人类监督即可培训。 这个模型可以从空间位置到语义嵌入矢量进行绘图。 然后, 映射可以用于多种任务, 如分割、 实例识别、 空间的语义搜索 和查看本地化 。 最重要的是, 映射只能通过网络图像和经过网络文本培训的模型( 如 CLIP、 Datic、 和 判刑- BERT ) 来进行监管。 与Mask- RCNN 等基线相比, 我们的方法比HM3D 数据集上少数实例识别或语义分割法要好, 仅举几个例子。 最后, 我们显示, 使用 CLIP- Fields 作为场进行真实世界环境中的语义导航。 我们的代码和演示可以在这里查阅 : https://mahis. life/clip-fields/ 。