Zorro:蒙面多式联运变压器 (Zorro: the masked multimodal transformer) - 专知论文

会员服务 ·

0

多峰值 · Learning · Networking · MoDELS · contrastive ·

2023 年 1 月 23 日

Zorro: the masked multimodal transformer

翻译：Zorro:蒙面多式联运变压器

Adrià Recasens,Jason Lin,Joāo Carreira,Drew Jaegle,Luyu Wang,Jean-baptiste Alayrac,Pauline Luc,Antoine Miech,Lucas Smaira,Ross Hemsley,Andrew Zisserman

Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.

翻译：关注型模型对多式联运具有吸引力,因为多种模式的投入可以混在一起并输入到单一的主干网中,因此需要很少的融合工程。由此产生的表述方式在整个网络中完全缠绕在一起,但不一定总是可取的:在学习中,有对比的视听自我监督学习需要独立的视听特征才能运行,否则则学习崩溃;推断,对视听模型的评价应当有可能建立在仅具有音频或公正视频的基准之上。在本文件中,我们引入了Zorro,这是一种使用遮罩来控制每种模式的投入如何在变换器中路由,保留代表模式的某个部分。我们将这一技术应用于三种流行的变压器结构(ViT、Swin和HiP),并表明通过对比性培训前Zorro在最相关的多式任务基准(AudioSet和VGGGSound)上取得了最新的结果。此外,由此产生的模型能够对Kinetics-400或ESC-50等视频和音频基准进行非模式的推断。

0

相关内容

多峰值

用于识别任务的视觉 Transformer 综述

用于识别任务的视觉 Transformer 综述

专知会员服务

73+阅读 · 2023年2月25日

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

50+阅读 · 2022年10月2日

【Facebook-Ishan Mishra】计算机视觉自监督学习，92页ppt

专知会员服务

36+阅读 · 2021年7月7日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

317+阅读 · 2020年11月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

「Github」多模态机器学习文章阅读列表

「Github」多模态机器学习文章阅读列表

专知

123+阅读 · 2019年8月15日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

17+阅读 · 2018年12月24日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

Ghrelin整合调控神经血管单元网络抑制脑缺血再灌注损伤并促进神经修复

国家自然科学基金

0+阅读 · 2014年12月31日

锈蚀钢筋混凝土柱抗震性能劣化研究

国家自然科学基金

0+阅读 · 2013年12月31日

车轮双轴疲劳虚拟试验的胎-轮间载荷传递机理及模型

国家自然科学基金

0+阅读 · 2013年12月31日

益气活血法对大鼠萎缩性胃炎Hedgehog信号通路的调控机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

CLU,CR1，PICALM基因多态性及相关因素与内蒙古蒙、汉族阿尔茨海默病人群的病例-对照研究

国家自然科学基金

0+阅读 · 2012年12月31日

流体动力学领域中若干具有奇异性的数学模型

国家自然科学基金

0+阅读 · 2012年12月31日

横向约束钢管混凝土柱低周疲劳性能的研究

国家自然科学基金

0+阅读 · 2011年12月31日

局部影响分析及其相关问题研究

国家自然科学基金

0+阅读 · 2011年12月31日

度序列与图性质及图的t-Pebbling数

国家自然科学基金

0+阅读 · 2011年12月31日

Legumain在乳腺癌骨转移和破骨损伤过程中的作用机制研究

国家自然科学基金

0+阅读 · 2009年12月31日

AMOM: Adaptive Masking over Masking for Conditional Masked Language Model

Arxiv

0+阅读 · 2023年3月13日

Contrastive Audio-Visual Masked Autoencoder

Arxiv

0+阅读 · 2023年3月10日

Multimodal Learning with Transformers: A Survey

Arxiv

69+阅读 · 2022年6月13日

A Survey on Vision Transformer

Arxiv

17+阅读 · 2022年2月23日

Masked Autoencoders Are Scalable Vision Learners

Arxiv

27+阅读 · 2021年11月11日

AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing

Arxiv

23+阅读 · 2021年8月12日

Attention Bottlenecks for Multimodal Fusion

Arxiv

31+阅读 · 2021年6月30日

Making Pre-trained Language Models Better Few-shot Learners

Arxiv

14+阅读 · 2020年12月31日

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Arxiv

15+阅读 · 2020年2月28日

End-to-End Dense Video Captioning with Masked Transformer

Arxiv

14+阅读 · 2018年4月3日

VIP会员

文章信息

相关主题

相关VIP内容

用于识别任务的视觉 Transformer 综述

用于识别任务的视觉 Transformer 综述

专知会员服务

73+阅读 · 2023年2月25日

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

50+阅读 · 2022年10月2日

【Facebook-Ishan Mishra】计算机视觉自监督学习，92页ppt

专知会员服务

36+阅读 · 2021年7月7日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

317+阅读 · 2020年11月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

热门VIP内容

开通专知VIP会员享更多权益服务

中文版 | 俄罗斯人工智能、战场自主化与战术核武器的融合

《量子云系统安全漏洞：新兴威胁综述》最新综述

中文版 | 特种作战部队新装备

《无人海洋载具发展综述：智能化与协同化》35页

相关资讯

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

「Github」多模态机器学习文章阅读列表

「Github」多模态机器学习文章阅读列表

专知

123+阅读 · 2019年8月15日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

17+阅读 · 2018年12月24日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

相关论文

AMOM: Adaptive Masking over Masking for Conditional Masked Language Model

Arxiv

0+阅读 · 2023年3月13日

Contrastive Audio-Visual Masked Autoencoder

Arxiv

0+阅读 · 2023年3月10日

Multimodal Learning with Transformers: A Survey

Arxiv

69+阅读 · 2022年6月13日

A Survey on Vision Transformer

Arxiv

17+阅读 · 2022年2月23日

Masked Autoencoders Are Scalable Vision Learners

Arxiv

27+阅读 · 2021年11月11日

AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing

Arxiv

23+阅读 · 2021年8月12日

Attention Bottlenecks for Multimodal Fusion

Arxiv

31+阅读 · 2021年6月30日

Making Pre-trained Language Models Better Few-shot Learners

Arxiv

14+阅读 · 2020年12月31日

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Arxiv

15+阅读 · 2020年2月28日

End-to-End Dense Video Captioning with Masked Transformer

Arxiv

14+阅读 · 2018年4月3日

相关基金

Ghrelin整合调控神经血管单元网络抑制脑缺血再灌注损伤并促进神经修复

国家自然科学基金

0+阅读 · 2014年12月31日

锈蚀钢筋混凝土柱抗震性能劣化研究

国家自然科学基金

0+阅读 · 2013年12月31日

车轮双轴疲劳虚拟试验的胎-轮间载荷传递机理及模型

国家自然科学基金

0+阅读 · 2013年12月31日

益气活血法对大鼠萎缩性胃炎Hedgehog信号通路的调控机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

CLU,CR1，PICALM基因多态性及相关因素与内蒙古蒙、汉族阿尔茨海默病人群的病例-对照研究

国家自然科学基金

0+阅读 · 2012年12月31日

流体动力学领域中若干具有奇异性的数学模型

国家自然科学基金

0+阅读 · 2012年12月31日

横向约束钢管混凝土柱低周疲劳性能的研究

国家自然科学基金

0+阅读 · 2011年12月31日

局部影响分析及其相关问题研究

国家自然科学基金

0+阅读 · 2011年12月31日

度序列与图性质及图的t-Pebbling数

国家自然科学基金

0+阅读 · 2011年12月31日

Legumain在乳腺癌骨转移和破骨损伤过程中的作用机制研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员