We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from any views. Previous straightforward solutions combine feed-forward reconstruction with Gaussian heads but suffer from fixed input views and insufficient 3D training data. In contrast, we propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images. Since the framework does not require 3D annotations, we can leverage large-scale video data with easily obtained 2D instance information to enrich semantic embedding. We also propose an instance-guided contrastive learning to align 2D semantics with the 3D representations. In addition, to mitigate the high memory and computational cost of dense views, we further propose a geometry-semantic hierarchical sparsification strategy. Our FLEG efficiently reconstructs language-embedded 3D Gaussian representation in a feed-forward manner from arbitrary sparse or dense views, jointly producing accurate geometry, high-fidelity appearance, and language-aligned semantics. Extensive experiments show that it outperforms existing methods on various related tasks. Project page: https://fangzhou2000.github.io/projects/fleg.
翻译:我们提出了FLEG,一种从前馈网络重建任意视角下语言嵌入3D高斯表示的方法。先前直接解决方案将前馈重建与高斯头部相结合,但受限于固定输入视角和不足的3D训练数据。相比之下,我们提出了一种无需3D标注的训练框架,能够从任意未标定、无姿态的多视角图像实现2D到3D的提升。由于该框架无需3D标注,我们可以利用具有易获取2D实例信息的大规模视频数据来丰富语义嵌入。我们还提出了一种实例引导的对比学习方法,以对齐2D语义与3D表示。此外,为缓解密集视角的高内存和计算成本,我们进一步提出了一种几何-语义分层稀疏化策略。我们的FLEG能够以前馈方式从任意稀疏或密集视角高效重建语言嵌入的3D高斯表示,同时生成精确的几何结构、高保真外观和语言对齐的语义。大量实验表明,其在多种相关任务上优于现有方法。项目页面:https://fangzhou2000.github.io/projects/fleg。