Tune-A-Video: 文本到视频生成图像传播模型的单位图示 (Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation)

To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problem$\unicode{x2014}$One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.

翻译：复制文本到图像( T2I) 生成的成功, 最近在文本到视频( T2V) 生成中, 使用大型文本到视频( T2V) 生成的数据集进行微调。然而, 这种模式在计算上是昂贵的。人类具有惊人的能力, 能够从一个示例中学习新的视觉概念。我们在此研究一个新的 T2V 生成问题$\ unicode{x2014} $One- Shot Videgency 生成, 其中只提供单一文本到视频的一对用于培训一个开放式的 T2V 生成器。我们直观地提议对T2I 生成模型的大规模图像数据进行大规模测试。我们提出两大关键观察:(1) T2I 模型能够生成与动词一致的图像;(2) 扩展 T2I 模型, 以同时生成多个令人惊讶的图像。为了进一步学习持续动作, 我们提议Tune- A- Videododoo, 配有定制的 Spress- Causacial at 注意, 通过高效的一幅式背景式图像转换, 生成视频的视频, 用于制作具有可变式的T2I 格式的版本的图像的版本的图像转换式的图像式版本。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日