Problems at the intersection of language and vision, like visual question answering, have recently been gaining a lot of attention in the field of multi-modal machine learning as computer vision research moves beyond traditional recognition tasks. There has been recent success in visual question answering using deep neural network models which use the linguistic structure of the questions to dynamically instantiate network layouts. In the process of converting the question to a network layout, the question is simplified, which results in loss of information in the model. In this paper, we enrich the image information with textual data using image captions and external knowledge bases to generate more coherent answers. We achieve 57.1% overall accuracy on the test-dev open-ended questions from the visual question answering (VQA 1.0) real image dataset.
翻译:语言和视觉交汇处的问题,如视觉问题解答,最近随着计算机视觉研究超越传统识别任务,在多式机器学习领域引起了人们的极大关注。最近,在使用深神经网络模型解答视觉问题方面取得了成功,这些模型将问题的语言结构用于动态即时网络布局。在将问题转换成网络布局的过程中,问题得到简化,导致模型中信息丢失。在本文中,我们利用图像说明和外部知识基础用文字数据来丰富图像信息,以产生更加一致的答案。我们实现了从视觉问题解答(VQA1.0)到真实图像数据集的测试式开放式问题的57.1%的总体准确性。