We study how to leverage off-the-shelf visual and linguistic data to cope with out-of-vocabulary answers in visual question answering. Existing large-scale visual data with annotations such as image class labels, bounding boxes and region descriptions are good sources for learning rich and diverse visual concepts. However, it is not straightforward how the visual concepts should be captured and transferred to visual question answering models due to missing link between question dependent answering models and visual data without question or task specification. We tackle this problem in two steps: 1) learning a task conditional visual classifier based on unsupervised task discovery and 2) transferring and adapting the task conditional visual classifier to visual question answering models. Specifically, we employ linguistic knowledge sources such as structured lexical database (e.g. Wordnet) and visual descriptions for unsupervised task discovery, and adapt a learned task conditional visual classifier to answering unit in a visual question answering model. We empirically show that the proposed algorithm generalizes to unseen answers successfully using the knowledge transferred from the visual data.
翻译:我们研究如何利用现成的视觉和语言数据来应对视觉问题解答中的词汇外解答案。现有的大型视觉数据,包括图像类标签、捆绑框和区域描述等说明,是学习丰富多样的视觉概念的良好来源。然而,由于根据问题解答模型和视觉数据之间缺少链接,因此如何捕捉这些视觉概念并将其转移到视觉解答模型,而没有问题解答模型或任务规格,这并非直截了当。我们分两步处理这一问题:(1)学习基于未监督的任务发现而设定任务有条件的视觉分类器;(2)转让和调整任务有条件的视觉分类器,使其适应视觉解答模型。具体地说,我们使用结构化的词汇数据库(如Wordnet)和未监督任务发现视觉描述等语言知识源,并将一个学习过的、有条件的视觉分类器改编成在视觉解答模型中解答单元。我们的经验显示,拟议的算法将利用从视觉数据中传输的知识成功地概括到看不见的答案。