We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.
翻译:本文提出了一种名为SceneMaker的解耦式三维场景生成框架。由于缺乏足够的开放集去遮挡与姿态估计先验知识,现有方法在严重遮挡和开放集场景下难以同时生成高质量的几何结构与精确的姿态。为解决这些问题,我们首先将去遮挡模型从三维物体生成中解耦出来,并利用图像数据集和收集的去遮挡数据集增强其处理多样化开放集遮挡模式的能力。随后,我们提出了一种统一的姿态估计模型,该模型在自注意力与交叉注意力机制中融合全局与局部机制以提升精度。此外,我们构建了一个开放集三维场景数据集,以进一步扩展姿态估计模型的泛化能力。综合实验表明,我们的解耦框架在室内场景与开放集场景中均展现出优越性能。代码与数据集已发布于 https://idea-research.github.io/SceneMaker/。