DATID-3D: 使用文本-图像扩散保持多样性的三维生成模型领域自适应 (DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model)

Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but training them for diverse domains is challenging since it requires massive training images and their camera distribution information. Text-guided domain adaptation methods have shown impressive performance on converting the 2D generative model on one domain into the models on other domains with different styles by leveraging the CLIP (Contrastive Language-Image Pre-training), rather than collecting massive datasets for those domains. However, one drawback of them is that the sample diversity in the original generative model is not well-preserved in the domain-adapted generative models due to the deterministic nature of the CLIP text encoder. Text-guided domain adaptation will be even more challenging for 3D generative models not only because of catastrophic diversity loss, but also because of inferior text-image correspondence and poor image quality. Here we propose DATID-3D, a domain adaptation method tailored for 3D generative models using text-to-image diffusion models that can synthesize diverse images per text prompt without collecting additional images and camera information for the target domain. Unlike 3D extensions of prior text-guided domain adaptation methods, our novel pipeline was able to fine-tune the state-of-the-art 3D generator of the source domain to synthesize high resolution, multi-view consistent images in text-guided targeted domains without additional data, outperforming the existing text-guided domain adaptation methods in diversity and text-image correspondence. Furthermore, we propose and demonstrate diverse 3D image manipulations such as one-shot instance-selected adaptation and single-view manipulated 3D reconstruction to fully enjoy diversity in text.

翻译：近期的三维生成模型已经在合成高分辨率、逼真的图像方面取得了显著的性能，同时还具备视角一致性和详细的三维形状，但是在不同领域中训练它们时仍然具有挑战性，因为这需要海量的训练图像及其相应的相机分布信息。指导领域适应的文本方法已经通过利用对比学习进行的语言-图像预训练(Clip)不仅在将二维生成模型转换为具有不同风格的另一个领域的模型中表现出卓越的性能，而且无需为那些领域收集大量数据集。然而，它们的一个缺点是原始生成模型中的样本多样性由于 CLIP 文本编码器的确定性而不能很好地在适应到不同领域后得到保持。对于三维生成模型，由于其决策性多样性丧失、劣质文本-图像对应性和图像质量差的原因，文本指导领域适应将更加具有挑战性。因此，我们提出了一种针对三维生成模型的领域适应方法 DATID-3D，使用文本-图像扩散模型，无需为目标领域收集额外的图像和相机信息即可生成多样的图像。与之前的文本指导领域适应方法不同，我们的新型流水线可以使源领域的最新三维生成器进行微调，以在文本指导下在目标领域中合成高分辨率、多视角一致的图像，而无需额外的数据，并且在多样性和文本-图像对应性方面优于现有的文本指导领域适应方法。此外，我们提出并展示了多样的三维图像操作，例如单次实例选择适应和单视角操作的三维重建，以充分享受文本中的多样性。