3DDESER: 迈向光现实的 3D 对象生成并使用文本制导扩散模型编辑 (3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models)

Text-guided diffusion models have shown superior performance in image/video generation and editing. While few explorations have been performed in 3D scenarios. In this paper, we discuss three fundamental and interesting problems on this topic. First, we equip text-guided diffusion models to achieve $\textbf{3D-consistent generation}$. Specifically, we integrate a NeRF-like neural field to generate low-resolution coarse results for a given camera view. Such results can provide 3D priors as condition information for the following diffusion process. During denoising diffusion, we further enhance the 3D consistency by modeling cross-view correspondences with a novel two-stream (corresponding to two different views) asynchronous diffusion process. Second, we study $\textbf{3D local editing}$ and propose a two-step solution that can generate 360$^{\circ}$ manipulated results by editing an object from a single view. Step 1, we propose to perform 2D local editing by blending the predicted noises. Step 2, we conduct a noise-to-text inversion process that maps 2D blended noises into the view-independent text embedding space. Once the corresponding text embedding is obtained, 360$^{\circ}$ images can be generated. Last but not least, we extend our model to perform \textbf{one-shot novel view synthesis} by fine-tuning on a single image, firstly showing the potential of leveraging text guidance for novel view synthesis. Extensive experiments and various applications show the prowess of our 3DDesigner. The project page is available at https://3ddesigner-diffusion.github.io/.

翻译：文本制导的传播模型在图像/ 视频生成和编辑中表现优异。虽然在 3D 情景中很少进行探索。在本文中, 我们讨论三个基本和有趣的问题。首先, 我们装备了文本制导的传播模型, 以达到 $\ textbf{ 3D- concistent 生成 $。具体地说, 我们整合了一个类似 NeRF 的神经字段, 以生成一个特定相机视图的低分辨率粗缩结果。这些结果可以提供 3D 前端信息, 作为随后的传播进程的条件。在拆译的传播过程中, 我们进一步增强 3D 一致性, 我们用新颖的双流应用( 响应两种不同的观点) 建模交叉视图。第二, 我们研究 $\ textb{ { 3D- 本地编辑} 并提议一个两步式解决方案, 通过编辑一个单一视图来生成 360$\ circrc} 。我们提议通过混合预测的噪音进行 2D 本地编辑。我们第一次进行噪动到正版的图像浏览浏览中, 。