同步的多模式注意数据集和分析 (A Synchronized Multi-Modal Attention-Caption Dataset and Analysis)

In this work, we present a novel multi-modal dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences between human attention in free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also compare human and machine attention, in particular the top-down soft attention approach that is argued to mimick human attention, in captioning tasks. Our study reveals that, (1) human attention behaviour in free-viewing is different than image description as humans tend to fixate on a greater variety of regions under the latter task; (2) there is a strong relationship between the described objects and the objects attended by subjects ($97\%$ of described objects are being attended); (3) a convolutional neural network as feature encoder captures regions that human attend under image captioning to a great extent (around $78\%$); (4) the soft-attention as the top-down mechanism does not agree with human attention behaviour neither spatially nor temporally; and (5) soft-attention does not add strong beneficial human-like attention behaviour for the task of captioning as it has low correlation between caption scores and attention consistency scores, indicating a large gap between human and machine in regard to top-down attention.

翻译：在这项工作中,我们展示了由视觉运动和口头描述同步记录到图像的新型多式数据集。我们利用这些数据研究人类在自由视觉和图像说明任务中的注意力差异。我们研究了人们在自由视觉和图像说明任务中的注意力和语言结构之间的关系。我们还比较了人类和机器的注意力,特别是自上而下的软注意力方法,该方法在说明任务中指向人类的注意力消沉;我们的研究显示:(1) 自由视觉中人类的注意力行为不同于图像描述,因为人类往往在后一任务中固定在更多的区域;(2) 所述对象和主体所参加的物体之间有着强烈的关系(参与的有97美元);(3) 进化神经网络作为特征的编码,捕捉到人类在图像下参与的区域,这在很大的程度上(大约78美元);(4) 上下调机制与人类的注意力行为不同,无论是在空间上还是时间上都与人类的注意力行为不相符合;(5) 软注意力不会给人以强烈的有益的注意力行为带来强烈的有利关系,因为人们的注意力在最高层次上的注意力与机器的分数上具有高度一致性。

相关内容

注意力机制

关注 0

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

64+阅读 · 2020年5月12日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

91+阅读 · 2020年3月12日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

18+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

45+阅读 · 2019年10月17日