UNITAB: 定位视觉语言建模的统一文本和框输出 (UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling) - 专知论文

会员服务 ·

0

输出 · MoDELS · 词元分析器 · Networking · 图像字幕 ·

2022 年 7 月 27 日

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

翻译：UNITAB: 定位视觉语言建模的统一文本和框输出

Zhengyuan Yang,Zhe Gan,Jianfeng Wang,Xiaowei Hu,Faisal Ahmed,Zicheng Liu,Yumao Lu,Lijuan Wang

from arxiv, ECCV 2022 (Oral Presentation)

We propose UniTAB that Unifies Text And Box outputs for grounded vision-language (VL) modeling. Grounded VL tasks such as grounded captioning require the model to generate a text description and align predicted words with object regions. To achieve this, models must generate desired text and box outputs together, and meanwhile indicate the alignments between words and boxes. In contrast to existing solutions that use multiple separate modules for different outputs, UniTAB represents both text and box outputs with a shared token sequence, and introduces a special <obj> token to naturally indicate word-box alignments in the sequence. UniTAB thus could provide a more comprehensive and interpretable image description, by freely grounding generated words to object regions. On grounded captioning, UniTAB presents a simpler solution with a single output head, and significantly outperforms state of the art in both grounding and captioning evaluations. On general VL tasks that have different desired output formats (i.e., text, box, or their combination), UniTAB with a single network achieves better or comparable performance than task-specific state of the art. Experiments cover 7 VL benchmarks, including grounded captioning, visual grounding, image captioning, and visual question answering. Furthermore, UniTAB's unified multi-task network and the task-agnostic output sequence design make the model parameter efficient and generalizable to new tasks.

翻译：我们建议 UnitaB 统一文本和框输出, 用于有根视觉语言( VL) 建模。有基的字幕等有基的 VL 任务需要模型来生成文本描述和与目标区域相匹配的预言词。要实现这一目标, 模型必须同时生成理想文本和框输出, 同时显示单词和框之间的匹配。与对不同输出使用多个不同模块的现有解决方案相比, UnitaB 代表了文本和框输出, 并带有一个共同符号序列, 并引入了一个特殊的 < obj> 符号, 以自然显示顺序中的单字框对齐。 UniTAB 可以通过将生成的单词与目标区域自由定位, 提供更加全面和可解释的图像描述。在有根的字幕描述上, UniTAB 提供了一个简单的解决方案, 以单一输出头, 并显著地优于对不同期望输出格式( 即文本、框或组合) 的通用网络比具体任务状态更好或可比的功能。 UnitaB 实验了 7 VL 高效的图像, 包括基于常规的图像设计、和图像基础的图像和图像。

0

相关内容

NLP必读经典文献100篇

专知会员服务

123+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

123+阅读 · 2020年7月18日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

59+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

161+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

18+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

45+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

168+阅读 · 2019年10月11日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

77+阅读 · 2019年10月9日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

64+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

39+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

2+阅读 · 2021年12月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

23+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

26+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

19+阅读 · 2017年12月17日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

10+阅读 · 2017年11月12日

两类带导数的非线性Schrodinger方程拟周期解的存在性

国家自然科学基金

0+阅读 · 2015年12月31日

Anderson型多酸的不对称修饰及可控组装研究

国家自然科学基金

1+阅读 · 2014年12月31日

Ghrelin抑制糖脂毒性诱导的心肌胰岛素抵抗的分子机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

Cocycle动力学和拟周期薛定谔算子的谱

国家自然科学基金

0+阅读 · 2012年12月31日

IRES调控EV71神经毒性的分子机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

CK2磷酸化抑制 TAp73促进骨肉瘤干细胞增殖的机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于Decorin基因甲基化调控的非小细胞肺癌转移的分子机制

国家自然科学基金

0+阅读 · 2011年12月31日

用血浆多肽组学研究和筛查非小细胞肺癌的早期生物标记物

国家自然科学基金

0+阅读 · 2011年12月31日

DNA损伤下调NDRG1蛋白并诱导PKCd活化的分子机制

国家自然科学基金

0+阅读 · 2009年12月31日

microRNA结合位点多态性与散发性食管癌易感性的研究

国家自然科学基金

0+阅读 · 2009年12月31日

Panoramic Vision Transformer for Saliency Detection in 360° Videos

Arxiv

0+阅读 · 2022年9月19日

Flexible and Structured Knowledge Grounded Question Answering

Arxiv

0+阅读 · 2022年9月17日

Selective Token Generation for Few-shot Natural Language Generation

Arxiv

0+阅读 · 2022年9月17日

Knowledge-Grounded Self-Rationalization via Extractive and Natural Language Explanations

Arxiv

0+阅读 · 2022年9月16日

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Arxiv

0+阅读 · 2022年9月15日

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Arxiv

0+阅读 · 2022年9月15日

Pix2seq: A Language Modeling Framework for Object Detection

Arxiv

10+阅读 · 2021年9月22日

K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering

Arxiv

15+阅读 · 2021年9月22日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

Arxiv

19+阅读 · 2020年11月18日

VIP会员

文章信息

相关主题

词元分析器

相关VIP内容

NLP必读经典文献100篇

专知会员服务

123+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

123+阅读 · 2020年7月18日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

59+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

161+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

18+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

45+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

168+阅读 · 2019年10月11日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

77+阅读 · 2019年10月9日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

64+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

39+阅读 · 2019年10月9日

热门VIP内容

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

2+阅读 · 2021年12月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

23+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

26+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

19+阅读 · 2017年12月17日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

10+阅读 · 2017年11月12日

相关论文

Panoramic Vision Transformer for Saliency Detection in 360° Videos

Arxiv

0+阅读 · 2022年9月19日

Flexible and Structured Knowledge Grounded Question Answering

Arxiv

0+阅读 · 2022年9月17日

Selective Token Generation for Few-shot Natural Language Generation

Arxiv

0+阅读 · 2022年9月17日

Knowledge-Grounded Self-Rationalization via Extractive and Natural Language Explanations

Arxiv

0+阅读 · 2022年9月16日

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Arxiv

0+阅读 · 2022年9月15日

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Arxiv

0+阅读 · 2022年9月15日

Pix2seq: A Language Modeling Framework for Object Detection

Arxiv

10+阅读 · 2021年9月22日

K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering

Arxiv

15+阅读 · 2021年9月22日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

Arxiv

19+阅读 · 2020年11月18日

相关基金

两类带导数的非线性Schrodinger方程拟周期解的存在性

国家自然科学基金

0+阅读 · 2015年12月31日

Anderson型多酸的不对称修饰及可控组装研究

国家自然科学基金

1+阅读 · 2014年12月31日

Ghrelin抑制糖脂毒性诱导的心肌胰岛素抵抗的分子机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

Cocycle动力学和拟周期薛定谔算子的谱

国家自然科学基金

0+阅读 · 2012年12月31日

IRES调控EV71神经毒性的分子机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

CK2磷酸化抑制 TAp73促进骨肉瘤干细胞增殖的机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于Decorin基因甲基化调控的非小细胞肺癌转移的分子机制

国家自然科学基金

0+阅读 · 2011年12月31日

用血浆多肽组学研究和筛查非小细胞肺癌的早期生物标记物

国家自然科学基金

0+阅读 · 2011年12月31日

DNA损伤下调NDRG1蛋白并诱导PKCd活化的分子机制

国家自然科学基金

0+阅读 · 2009年12月31日

microRNA结合位点多态性与散发性食管癌易感性的研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员