GLIGEN: 开放的、有固定基础的文本到图像生成 (GLIGEN: Open-Set Grounded Text-to-Image Generation)

Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configuration and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms that of existing supervised layout-to-image baselines by a large margin.

翻译：大规模文本到图像传播模型取得了惊人的进展,然而,现状是仅使用文字输入,这可能会妨碍控制。在这项工作中,我们提议GLIGEN, 基底-语言到图像生成,这是一种创新办法,它以现有的经过预先培训的文字到图像传播模型为基础,并扩展了这些模型的功能,使这些模型也能够以地面投入为条件。为了保持预先培训模型的广泛概念知识,我们通过一个门式机制冻结其所有重量,并将地面信息注入新的可训练层。我们的模型实现了以字幕和捆绑框条件输入的基于世界的文本2img生成,地面能力将广泛应用于新的空间配置和概念。GLIGEN在CO和LVIS上的零弹式性能大大超过了现有的受监督的布局到图像基线。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

50+阅读 · 2022年10月2日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

【KDD2020教程】多模态网络表示学习

专知会员服务

131+阅读 · 2020年8月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日