In this work, we introduce a generative approach for pose-free (without camera parameters) reconstruction of 360 scenes from a sparse set of 2D images. Pose-free scene reconstruction from incomplete, pose-free observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of large complex scenes (with high degree of foreground and background detail) with known camera poses using view-conditioned generative priors, these methods cannot be directly adapted for the pose-free setting when ground-truth poses are not available during evaluation. To address this, we propose an image-to-image generative model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We introduce context and geometry conditioning using Feature-wise Linear Modulation (FiLM) modulation layers as a lightweight alternative to cross-attention and also propose a novel confidence measure for 3D Gaussian splat representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent 3D representation. Evaluations on the MipNeRF360 and DL3DV-10K benchmark dataset demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed (precomputed camera parameters are given) reconstruction methods in complex 360 scenes. Our project page provides additional results, videos, and code.
翻译:本文提出一种生成式方法,用于从稀疏二维图像集实现360度场景的无姿态(无需相机参数)重建。从不完整、无姿态的观测数据中进行场景重建通常需要借助深度估计或三维基础先验进行正则化。尽管近期进展已能利用视角条件生成先验,在已知相机姿态下实现具有高度前景与背景细节的大型复杂场景的稀疏视角重建,但这些方法无法直接适用于评估阶段缺乏真实姿态数据的无姿态场景。为此,我们提出一种专为修复三维场景新视角渲染图与深度图中缺失细节并消除伪影而设计的图像到图像生成模型。我们引入基于特征线性调制(FiLM)层的上下文与几何条件机制作为交叉注意力的轻量化替代方案,并提出一种针对三维高斯泼溅表示的新型置信度度量,以提升对这些伪影的检测能力。通过在高斯SLAM启发的过程中逐步融合这些新视角,我们实现了多视角一致的三维表示。在MipNeRF360和DL3DV-10K基准数据集上的评估表明,本方法超越了现有无姿态重建技术,并在复杂360度场景中与最先进的已知姿态(预计算相机参数已给定)重建方法取得相当性能。项目页面提供了更多结果、视频及代码。