Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a {\bf M}ultimodal {\bf I}ncremental {\bf T}ransformer with {\bf V}isual {\bf G}rounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of the proposed model, which achieves comparable performance.


翻译:视觉对话是一项具有挑战性的任务,因为它需要在理解视觉环境的基础上回答一系列连贯的问题。 以前的研究侧重于隐含探索多式共同参照,暗含关注空间图像特征或目标级图像特征,但忽视了将物体明确定位在视觉内容中的重要性,而视觉内容与文本内容中的实体相关。 因此,在本文中,我们建议使用一个 {bf M}ultimodal {bf I}ntremental {bremental {bf Transxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

0
下载
关闭预览

相关内容

专知会员服务
50+阅读 · 2021年8月8日
最新《Transformers模型》教程,64页ppt
专知会员服务
278+阅读 · 2020年11月26日
因果图,Causal Graphs,52页ppt
专知会员服务
238+阅读 · 2020年4月19日
Hierarchically Structured Meta-learning
CreateAMind
23+阅读 · 2019年5月22日
Transferring Knowledge across Learning Processes
CreateAMind
25+阅读 · 2019年5月18日
Unsupervised Learning via Meta-Learning
CreateAMind
41+阅读 · 2019年1月3日
MoCoGAN 分解运动和内容的视频生成
CreateAMind
18+阅读 · 2017年10月21日
Arxiv
19+阅读 · 2020年12月23日
Exploring Visual Relationship for Image Captioning
Arxiv
14+阅读 · 2018年9月19日
VIP会员
相关VIP内容
专知会员服务
50+阅读 · 2021年8月8日
最新《Transformers模型》教程,64页ppt
专知会员服务
278+阅读 · 2020年11月26日
因果图,Causal Graphs,52页ppt
专知会员服务
238+阅读 · 2020年4月19日
Top
微信扫码咨询专知VIP会员