Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configuration and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms that of existing supervised layout-to-image baselines by a large margin.
翻译:大规模文本到图像传播模型取得了惊人的进展,然而,现状是仅使用文字输入,这可能会妨碍控制。在这项工作中,我们提议GLIGEN, 基底-语言到图像生成,这是一种创新办法,它以现有的经过预先培训的文字到图像传播模型为基础,并扩展了这些模型的功能,使这些模型也能够以地面投入为条件。为了保持预先培训模型的广泛概念知识,我们通过一个门式机制冻结其所有重量,并将地面信息注入新的可训练层。我们的模型实现了以字幕和捆绑框条件输入的基于世界的文本2img生成,地面能力将广泛应用于新的空间配置和概念。GLIGEN在CO和LVIS上的零弹式性能大大超过了现有的受监督的布局到图像基线。