VILT:没有革命或地区监督的愿景和语言变异器 (ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision) - 专知论文

会员服务 ·

0

变换 · 卷积 · Processing（编程语言） · INTERACT · Performer ·

2021 年 2 月 5 日

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

翻译：VILT:没有革命或地区监督的愿景和语言变异器

Wonjae Kim,Bokyung Son,Ildoo Kim

from arxiv, 11 pages, 6 figures, 5 tables

Vision-and-Language Pretraining (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches for VLP heavily rely on image feature extraction processes, most of which involve region supervisions (e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more computation than the actual multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive power of the visual encoder and its predefined visual vocabulary. In this paper, we present a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to 60 times faster than previous VLP models, yet with competitive or better downstream task performance.

翻译：视觉和语言预科培训(VLP)在各种共同愿景和语言下游任务上的业绩有所改进。目前VLP的方法主要依赖图像特征提取过程,其中多数涉及区域监督(如物体探测)和革命结构(如ResNet ) 。尽管在文献中被忽视,但我们发现在以下两个方面都存在问题:(1) 效率/速度,即仅仅提取输入特征所需要的计算量远远多于实际的多式联运互动步骤;(2) 表达力,因为它与视觉编码器及其预先定义的视觉词汇的表达力高度相连。在本文件中,我们提出了一个最小的VLP模型、视觉和语言变形器(VILT ), 单立语, 意思是视觉投入的处理被大大简化到与我们处理文字输入的相同的无革命性方式。我们显示VILT比以往的VLP模型快60倍,但具有竞争性或更高的下游任务性。

3

相关内容

【伯克利】自回归模型的局部掩卷积，Locally Masked Convolution for Autoregressive Models

【伯克利】自回归模型的局部掩卷积，Locally Masked Convolution for Autoregressive Models

专知会员服务

19+阅读 · 2020年6月23日

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

专知会员服务

23+阅读 · 2020年3月31日

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

专知会员服务

30+阅读 · 2020年3月11日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

35+阅读 · 2020年3月3日

【2020新书】深度学习视觉系统，Deep Learning for Vision Systems, 396页pdf

【2020新书】深度学习视觉系统，Deep Learning for Vision Systems, 396页pdf

专知会员服务

166+阅读 · 2020年2月18日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

116+阅读 · 2020年2月3日

【ICLR2020】理解非自回归机器翻译中的知识蒸馏（Understanding Knowledge Distillation in Non-autoregressive Machine Translation）

【ICLR2020】理解非自回归机器翻译中的知识蒸馏（Understanding Knowledge Distillation in Non-autoregressive Machine Translation）

专知会员服务

10+阅读 · 2019年12月28日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

45+阅读 · 2019年10月17日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

64+阅读 · 2019年10月9日

计算机视觉最佳实践、代码示例和相关文档

计算机视觉最佳实践、代码示例和相关文档

专知会员服务

17+阅读 · 2019年10月9日

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

VALSE Webinar 19-10期 Connecting Vision and Language to Action

VALSE Webinar 19-10期 Connecting Vision and Language to Action

VALSE

3+阅读 · 2019年4月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

16+阅读 · 2018年12月24日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

19+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

8+阅读 · 2017年11月25日

【音乐】Attention

【音乐】Attention

英语演讲视频每日一推

3+阅读 · 2017年8月22日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

Diagnosing Vision-and-Language Navigation: What Really Matters

Arxiv

0+阅读 · 2021年3月30日

Rethinking Spatial Dimensions of Vision Transformers

Arxiv

0+阅读 · 2021年3月30日

CvT: Introducing Convolutions to Vision Transformers

Arxiv

2+阅读 · 2021年3月29日

ViViT: A Video Vision Transformer

Arxiv

0+阅读 · 2021年3月29日

Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review

Arxiv

0+阅读 · 2021年3月27日

Speech2Action: Cross-modal Supervision for Action Recognition

Speech2Action: Cross-modal Supervision for Action Recognition

Arxiv

7+阅读 · 2020年3月30日

Doubly Attentive Transformer Machine Translation

Doubly Attentive Transformer Machine Translation

Arxiv

4+阅读 · 2018年7月30日

Learning to Generate and Reconstruct 3D Meshes with only 2D Supervision

Learning to Generate and Reconstruct 3D Meshes with only 2D Supervision

Arxiv

5+阅读 · 2018年7月24日

Iterative Visual Reasoning Beyond Convolutions

Arxiv

3+阅读 · 2018年3月29日

Language Modeling with Gated Convolutional Networks

Arxiv

5+阅读 · 2017年9月8日

VIP会员

文章信息

相关主题

Processing（编程语言）

相关VIP内容

【伯克利】自回归模型的局部掩卷积，Locally Masked Convolution for Autoregressive Models

【伯克利】自回归模型的局部掩卷积，Locally Masked Convolution for Autoregressive Models

专知会员服务

19+阅读 · 2020年6月23日

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

专知会员服务

23+阅读 · 2020年3月31日

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

专知会员服务

30+阅读 · 2020年3月11日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

35+阅读 · 2020年3月3日

【2020新书】深度学习视觉系统，Deep Learning for Vision Systems, 396页pdf

【2020新书】深度学习视觉系统，Deep Learning for Vision Systems, 396页pdf

专知会员服务

166+阅读 · 2020年2月18日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

116+阅读 · 2020年2月3日

【ICLR2020】理解非自回归机器翻译中的知识蒸馏（Understanding Knowledge Distillation in Non-autoregressive Machine Translation）

【ICLR2020】理解非自回归机器翻译中的知识蒸馏（Understanding Knowledge Distillation in Non-autoregressive Machine Translation）

专知会员服务

10+阅读 · 2019年12月28日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

45+阅读 · 2019年10月17日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

64+阅读 · 2019年10月9日

计算机视觉最佳实践、代码示例和相关文档

计算机视觉最佳实践、代码示例和相关文档

专知会员服务

17+阅读 · 2019年10月9日

热门VIP内容

相关资讯

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

VALSE Webinar 19-10期 Connecting Vision and Language to Action

VALSE Webinar 19-10期 Connecting Vision and Language to Action

VALSE

3+阅读 · 2019年4月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

16+阅读 · 2018年12月24日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

19+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

8+阅读 · 2017年11月25日

【音乐】Attention

【音乐】Attention

英语演讲视频每日一推

3+阅读 · 2017年8月22日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

相关论文

Diagnosing Vision-and-Language Navigation: What Really Matters

Arxiv

0+阅读 · 2021年3月30日

Rethinking Spatial Dimensions of Vision Transformers

Arxiv

0+阅读 · 2021年3月30日

CvT: Introducing Convolutions to Vision Transformers

Arxiv

2+阅读 · 2021年3月29日

ViViT: A Video Vision Transformer

Arxiv

0+阅读 · 2021年3月29日

Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review

Arxiv

0+阅读 · 2021年3月27日

Speech2Action: Cross-modal Supervision for Action Recognition

Speech2Action: Cross-modal Supervision for Action Recognition

Arxiv

7+阅读 · 2020年3月30日

Doubly Attentive Transformer Machine Translation

Doubly Attentive Transformer Machine Translation

Arxiv

4+阅读 · 2018年7月30日

Learning to Generate and Reconstruct 3D Meshes with only 2D Supervision

Learning to Generate and Reconstruct 3D Meshes with only 2D Supervision

Arxiv

5+阅读 · 2018年7月24日

Iterative Visual Reasoning Beyond Convolutions

Arxiv

3+阅读 · 2018年3月29日

Language Modeling with Gated Convolutional Networks

Arxiv

5+阅读 · 2017年9月8日

微信扫码咨询专知VIP会员