eDiffi:配有专家Denoisers组合的文本到图像扩散模型 (eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers)

Yogesh Balaji,Seungjun Nah,Xun Huang,Arash Vahdat,Jiaming Song,Karsten Kreis,Miika Aittala,Timo Aila,Samuli Laine,Bryan Catanzaro,Tero Karras,Ming-Yu Liu

Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiffi's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiffi/

翻译：大型的基于扩散的基因化模型导致文本条件高清晰度图像合成的突破。从随机噪音开始,这种文本到图像的图像扩散模型会以迭接的方式以迭接的方式逐渐合成图像,同时对文本提示进行调试。我们发现,在整个过程中,它们的合成行为在质量上发生了质的变化: 在取样初期, 生成强烈依赖文本快速生成文本来生成文本调适内容, 而后来, 文本调控几乎完全被完全忽略。这意味着在整个生成过程中共享模型参数可能并不理想。因此, 与现有的工程相比, 我们提议从随机噪音开始, 这些文本到图像合成阶段的成像化模型会逐渐地以迭接的方式将图像合成成迭接合。在标准基准中, 我们用一个模块来开发一个版本到图像化的成像素传播模式, 并用一个版本化的C- 格式化输出输出输出, 以显示一个不同图像的C- 格式化输出, 以显示这些图像的成型的C- 格式化成型的图像, 将一个图像转换为C- 格式的立式图像。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/