VALOR: 视听语言全方位感知预训练模型与数据集 (VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset) - 专知论文

会员服务 ·

0

模态 · 多模 · MGA · 预训练 · 多模态 ·

2023 年 4 月 17 日

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

翻译：VALOR: 视听语言全方位感知预训练模型与数据集

Sihan Chen,Xingjian He,Longteng Guo,Xinxin Zhu,Weining Wang,Jinhui Tang,Jing Liu

from arxiv, Preprint version w/o audio files embeded in PDF. Audio embeded version can be found on project page or github

In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner. It contains three separate encoders for single modality representations, and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain VALOR model, including Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language and audio to the same common space, building vision-language, audio-language and audiovisual-language alignment simultaneously. MGC learns how to generate text tokens in conditions of vision, audio or their both. To promote vision-audio-language pretraining research, we construct a large-scale high-quality tri-modality dataset named VALOR-1M, which contains 1M audiable videos with human annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks (e.g., retrieval, captioning and question answering), with different input modalities (e.g., vision-language, audio-language and audiovisual-language). VALOR achieves new state-of-the-art performances on series of public cross-modality benchmarks. Code and data are available at project page https://casia-iva-group.github.io/projects/VALOR.

翻译：本文提出了一种视听语言全方位感知预训练模型（VALOR）用于多模态理解和生成。不同于广泛研究的视觉-语言预训练模型，VALOR以端到端的方式共同建模了视觉、音频和语言之间的关系。它包含三个单模态编码器和一个用于多模态条件文本生成的解码器。我们设计了两个预文本任务来预训练VALOR模型，包括多模态分组对齐（MGA）和多模态分组字幕（MGC）。MGA将视觉、语言和音频投影到相同的公共空间中，同时建立了视觉-语言、音频-语言和音频视觉-语言对齐。MGC学习如何在视觉、音频或两者条件下生成文本标记。为推动视听语言预训练研究，我们构建了一个大规模高质量的三模态数据集VALOR-1M，其中包含100万个经过人工标注的带有音视频字幕的视频。大量实验证明，VALOR可以学习强大的多模态相关性，并能推广到各种下游任务（例如检索、字幕和问答），具有不同的输入模态（例如视觉-语言、音频-语言和音频视觉-语言）。VALOR在一系列公共跨模型基准测试中实现了新的最高性能。代码和数据可在项目页面https://casia-iva-group.github.io/projects/VALOR上获得。

0

相关内容

【CVPR2023】I2MVFormer:大语言模型生成的多视图文档监督零样本图像分类

【CVPR2023】I2MVFormer:大语言模型生成的多视图文档监督零样本图像分类

专知会员服务

21+阅读 · 2023年3月1日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

【香港科技大学等】视觉-语言智能:任务、表示学习和大模型，Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

【香港科技大学等】视觉-语言智能:任务、表示学习和大模型，Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

专知会员服务

44+阅读 · 2022年3月8日

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

专知会员服务

29+阅读 · 2022年3月6日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【AAAI2021】知识增强的视觉-语言预训练技术 ERNIE-ViL

【AAAI2021】知识增强的视觉-语言预训练技术 ERNIE-ViL

专知会员服务

26+阅读 · 2021年1月29日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

【Google】多模态Transformer视频检索，Multi-modal Transformer

【Google】多模态Transformer视频检索，Multi-modal Transformer

专知会员服务

103+阅读 · 2020年7月22日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

7 Papers & Radios | 谷歌推出DreamBooth扩散模型；张益唐零点猜想论文出炉

7 Papers & Radios | 谷歌推出DreamBooth扩散模型；张益唐零点猜想论文出炉

机器之心

2+阅读 · 2022年11月13日

论文浅尝 | 弱监督下极简的视觉语言预训练模型

论文浅尝 | 弱监督下极简的视觉语言预训练模型

开放知识图谱

1+阅读 · 2022年9月26日

NLP领域最近比较火的Prompt，能否借鉴到多模态领域？一文跟进最新进展

NLP领域最近比较火的Prompt，能否借鉴到多模态领域？一文跟进最新进展

PaperWeekly

17+阅读 · 2022年3月8日

文本+视觉，多篇 Visual/Video BERT 论文介绍

文本+视觉，多篇 Visual/Video BERT 论文介绍

AI科技评论

22+阅读 · 2019年8月30日

GitHub超9千星：一个API调用27个NLP预训练模型

GitHub超9千星：一个API调用27个NLP预训练模型

新智元

17+阅读 · 2019年7月22日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

【论文推荐】最新七篇图像分割相关论文—域适应深度表示学习、循环残差卷积、二值分割、图像合成、无监督跨模态

【论文推荐】最新七篇图像分割相关论文—域适应深度表示学习、循环残差卷积、二值分割、图像合成、无监督跨模态

专知

19+阅读 · 2018年6月1日

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

专知

15+阅读 · 2018年5月28日

【论文推荐】最新八篇生成对抗网络相关论文—条件翻译、RGB-D动作识别、量子生成对抗网络、语义对齐、视频摘要、视觉-文本注意力

【论文推荐】最新八篇生成对抗网络相关论文—条件翻译、RGB-D动作识别、量子生成对抗网络、语义对齐、视频摘要、视觉-文本注意力

专知

15+阅读 · 2018年5月15日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

“自然语言-草图”耦合的地理场景查询方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

面向地理模型集成与运行的数据适配方法研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于多尺度可变形分层图模型的复杂环境3D感知研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于Ontology的藏文语料库检索关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于语言理解的机器翻译方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

金属双环特异介质链中磁等离极化激元的激发与传播研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于视感知的图像视频语义获取关键技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

句子语义的视觉表示研究

国家自然科学基金

4+阅读 · 2009年12月31日

基于中心扩展对齐的汉-英统计机器翻译研究

国家自然科学基金

1+阅读 · 2009年12月31日

TRAIL作为治疗银屑病新的药物作用靶点

国家自然科学基金

0+阅读 · 2008年12月31日

ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

Arxiv

0+阅读 · 2023年6月1日

Can Large Pre-trained Models Help Vision Models on Perception Tasks?

Arxiv

0+阅读 · 2023年6月1日

Level Generation Through Large Language Models

Arxiv

0+阅读 · 2023年6月1日

Understanding and Mitigating Copying in Diffusion Models

Arxiv

0+阅读 · 2023年5月31日

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

Arxiv

0+阅读 · 2023年5月30日

Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models

Arxiv

0+阅读 · 2023年5月30日

ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation

Arxiv

0+阅读 · 2023年5月30日

Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic Space

Arxiv

0+阅读 · 2023年5月30日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

VIP会员

文章信息

相关主题

相关VIP内容

【CVPR2023】I2MVFormer:大语言模型生成的多视图文档监督零样本图像分类

【CVPR2023】I2MVFormer:大语言模型生成的多视图文档监督零样本图像分类

专知会员服务

21+阅读 · 2023年3月1日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

【香港科技大学等】视觉-语言智能:任务、表示学习和大模型，Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

【香港科技大学等】视觉-语言智能:任务、表示学习和大模型，Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

专知会员服务

44+阅读 · 2022年3月8日

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

专知会员服务

29+阅读 · 2022年3月6日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【AAAI2021】知识增强的视觉-语言预训练技术 ERNIE-ViL

【AAAI2021】知识增强的视觉-语言预训练技术 ERNIE-ViL

专知会员服务

26+阅读 · 2021年1月29日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

【Google】多模态Transformer视频检索，Multi-modal Transformer

【Google】多模态Transformer视频检索，Multi-modal Transformer

专知会员服务

103+阅读 · 2020年7月22日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

热门VIP内容

开通专知VIP会员享更多权益服务

【伯克利博士论文】通过真实世界实践赋能机器人自主性

军用无人机集群技术尚未成熟——但潜力可期

人工智能安全治理白皮书（2025）

AgentOps综述：分类、挑战与未来方向

相关资讯

7 Papers & Radios | 谷歌推出DreamBooth扩散模型；张益唐零点猜想论文出炉

7 Papers & Radios | 谷歌推出DreamBooth扩散模型；张益唐零点猜想论文出炉

机器之心

2+阅读 · 2022年11月13日

论文浅尝 | 弱监督下极简的视觉语言预训练模型

论文浅尝 | 弱监督下极简的视觉语言预训练模型

开放知识图谱

1+阅读 · 2022年9月26日

NLP领域最近比较火的Prompt，能否借鉴到多模态领域？一文跟进最新进展

NLP领域最近比较火的Prompt，能否借鉴到多模态领域？一文跟进最新进展

PaperWeekly

17+阅读 · 2022年3月8日

文本+视觉，多篇 Visual/Video BERT 论文介绍

文本+视觉，多篇 Visual/Video BERT 论文介绍

AI科技评论

22+阅读 · 2019年8月30日

GitHub超9千星：一个API调用27个NLP预训练模型

GitHub超9千星：一个API调用27个NLP预训练模型

新智元

17+阅读 · 2019年7月22日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

【论文推荐】最新七篇图像分割相关论文—域适应深度表示学习、循环残差卷积、二值分割、图像合成、无监督跨模态

【论文推荐】最新七篇图像分割相关论文—域适应深度表示学习、循环残差卷积、二值分割、图像合成、无监督跨模态

专知

19+阅读 · 2018年6月1日

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

专知

15+阅读 · 2018年5月28日

【论文推荐】最新八篇生成对抗网络相关论文—条件翻译、RGB-D动作识别、量子生成对抗网络、语义对齐、视频摘要、视觉-文本注意力

【论文推荐】最新八篇生成对抗网络相关论文—条件翻译、RGB-D动作识别、量子生成对抗网络、语义对齐、视频摘要、视觉-文本注意力

专知

15+阅读 · 2018年5月15日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

相关论文

ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

Arxiv

0+阅读 · 2023年6月1日

Can Large Pre-trained Models Help Vision Models on Perception Tasks?

Arxiv

0+阅读 · 2023年6月1日

Level Generation Through Large Language Models

Arxiv

0+阅读 · 2023年6月1日

Understanding and Mitigating Copying in Diffusion Models

Arxiv

0+阅读 · 2023年5月31日

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

Arxiv

0+阅读 · 2023年5月30日

Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models

Arxiv

0+阅读 · 2023年5月30日

ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation

Arxiv

0+阅读 · 2023年5月30日

Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic Space

Arxiv

0+阅读 · 2023年5月30日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

相关基金

“自然语言-草图”耦合的地理场景查询方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

面向地理模型集成与运行的数据适配方法研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于多尺度可变形分层图模型的复杂环境3D感知研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于Ontology的藏文语料库检索关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于语言理解的机器翻译方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

金属双环特异介质链中磁等离极化激元的激发与传播研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于视感知的图像视频语义获取关键技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

句子语义的视觉表示研究

国家自然科学基金

4+阅读 · 2009年12月31日

基于中心扩展对齐的汉-英统计机器翻译研究

国家自然科学基金

1+阅读 · 2009年12月31日

TRAIL作为治疗银屑病新的药物作用靶点

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员