We propose to use pretraining to boost general image-to-image translation. Prior image-to-image translation methods usually need dedicated architectural design and train individual translation models from scratch, struggling for high-quality generation of complex scenes, especially when paired training data are not abundant. In this paper, we regard each image-to-image translation problem as a downstream task and introduce a simple and generic framework that adapts a pretrained diffusion model to accommodate various kinds of image-to-image translation. We also propose adversarial training to enhance the texture synthesis in the diffusion model training, in conjunction with normalized guidance sampling to improve the generation quality. We present extensive empirical comparison across various tasks on challenging benchmarks such as ADE20K, COCO-Stuff, and DIODE, showing the proposed pretraining-based image-to-image translation (PITI) is capable of synthesizing images of unprecedented realism and faithfulness.
翻译:我们建议使用预培训来提升一般图像到图像翻译。 先前的图像到图像翻译方法通常需要专门的建筑设计,从零开始培训单个翻译模型,为高质量的复杂场景的生成而奋斗,特别是当配对培训数据不足时。 在本文中,我们将每个图像到图像翻译问题视为一项下游任务,并引入一个简单和通用的框架,以调整预先培训的传播模型,以适应各种图像到图像翻译。 我们还提议进行对抗性培训,以加强传播模型培训中的素材合成,同时进行标准化的指导抽样,以提高生成质量。 我们在诸如ADE20K、COCO-Stuff和DIODE等具有挑战性的基准上进行了广泛的实证比较,展示了基于培训前图像到图像翻译(PITI)的提议,能够将前所未有的真实性和忠诚性图像合成。