In this work, we present a novel multi-modal dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences between human attention in free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also compare human and machine attention, in particular the top-down soft attention approach that is argued to mimick human attention, in captioning tasks. Our study reveals that, (1) human attention behaviour in free-viewing is different than image description as humans tend to fixate on a greater variety of regions under the latter task; (2) there is a strong relationship between the described objects and the objects attended by subjects ($97\%$ of described objects are being attended); (3) a convolutional neural network as feature encoder captures regions that human attend under image captioning to a great extent (around $78\%$); (4) the soft-attention as the top-down mechanism does not agree with human attention behaviour neither spatially nor temporally; and (5) soft-attention does not add strong beneficial human-like attention behaviour for the task of captioning as it has low correlation between caption scores and attention consistency scores, indicating a large gap between human and machine in regard to top-down attention.
翻译:在这项工作中,我们展示了由视觉运动和口头描述同步记录到图像的新型多式数据集。我们利用这些数据研究人类在自由视觉和图像说明任务中的注意力差异。我们研究了人们在自由视觉和图像说明任务中的注意力和语言结构之间的关系。我们还比较了人类和机器的注意力,特别是自上而下的软注意力方法,该方法在说明任务中指向人类的注意力消沉;我们的研究显示:(1) 自由视觉中人类的注意力行为不同于图像描述,因为人类往往在后一任务中固定在更多的区域;(2) 所述对象和主体所参加的物体之间有着强烈的关系(参与的有97美元);(3) 进化神经网络作为特征的编码,捕捉到人类在图像下参与的区域,这在很大的程度上(大约78美元);(4) 上下调机制与人类的注意力行为不同,无论是在空间上还是时间上都与人类的注意力行为不相符合;(5) 软注意力不会给人以强烈的有益的注意力行为带来强烈的有利关系,因为人们的注意力在最高层次上的注意力与机器的分数上具有高度一致性。