视觉问答(Visual Question Answering,VQA),是一种涉及计算机视觉和自然语言处理的学习任务。这一任务的定义如下: A VQA system takes as input an image and a free-form, open-ended, natural-language question about the image and produces a natural-language answer as the output[1]。 翻译为中文:一个VQA系统以一张图片和一个关于这张图片形式自由、开放式的自然语言问题作为输入,以生成一条自然语言答案作为输出。简单来说,VQA就是给定的图片进行问答。

视觉问答(Visual Question Answering,VQA)专知荟萃

入门学习

  1. 基于深度学习的VQA(视觉问答)技术
  2. 视觉问答全景概述:从数据集到技术方法
  3. 论文读书笔记(Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding)
  4. 能看图回答问题的AI离我们还有多远?Facebook向视觉对话进发
  5. 图像问答Image Question Answering
  6. 实战深度学习之图像问答
  7. 2017 VQA Challenge 第一名技术报告
  8. 深度学习为视觉和语言之间搭建了一座桥梁

进阶论文

  1. Kushal Kafle, and Christopher Kanan. Visual question answering: Datasets, algorithms, and future challenges. Computer Vision and Image Understanding [2017].
  2. Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick, CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, CVPR 2017.
  3. Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick, Inferring and Executing Programs for Visual Reasoning, arXiv:1705.03633, 2017. [https://arxiv.org/abs/1705.03633]
  4. Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko, Learning to Reason: End-to-End Module Networks for Visual Question Answering, arXiv:1704.05526, 2017. [https://arxiv.org/abs/1704.05526]
  5. Adam Santoro, David Raposo, David G.T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, Timothy Lillicrap, A simple neural network module for relational reasoning, arXiv:1706.01427, 2017. [https://arxiv.org/abs/1706.01427]
  6. Hedi Ben-younes, Remi Cadene, Matthieu Cord, Nicolas Thome: MUTAN: Multimodal Tucker Fusion for Visual Question Answering [https://arxiv.org/pdf/1705.06676.pdf] [https://github.com/Cadene/vqa.pytorch]
  7. Vahid Kazemi, Ali Elqursh, Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering, arXiv:1704.03162, 2016. [https://arxiv.org/abs/1704.03162] [https://github.com/Cyanogenoid/pytorch-vqa]
  8. Kushal Kafle, and Christopher Kanan. An Analysis of Visual Question Answering Algorithms. arXiv:1703.09684, 2017. [https://arxiv.org/abs/1703.09684]
  9. Hyeonseob Nam, Jung-Woo Ha, Jeonghee Kim, Dual Attention Networks for Multimodal Reasoning and Matching, arXiv:1611.00471, 2016. [https://arxiv.org/abs/1611.00471]
  10. Jin-Hwa Kim, Kyoung Woon On, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang, Hadamard Product for Low-rank Bilinear Pooling, arXiv:1610.04325, 2016. [https://arxiv.org/abs/1610.04325]
  11. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach, Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, arXiv:1606.01847, 2016. [https://arxiv.org/abs/1606.01847] [https://github.com/akirafukui/vqa-mcb]
  12. Kuniaki Saito, Andrew Shin, Yoshitaka Ushiku, Tatsuya Harada, DualNet: Domain-Invariant Network for Visual Question Answering. arXiv:1606.06108v1, 2016. [https://arxiv.org/pdf/1606.06108.pdf]
  13. Arijit Ray, Gordon Christie, Mohit Bansal, Dhruv Batra, Devi Parikh, Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions, arXiv:1606.06622, 2016. [https://arxiv.org/pdf/1606.06622v1.pdf]
  14. Hyeonwoo Noh, Bohyung Han, Training Recurrent Answering Units with Joint Loss Minimization for VQA, arXiv:1606.03647, 2016. [http://arxiv.org/abs/1606.03647v1]
  15. Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh, Hierarchical Question-Image Co-Attention for Visual Question Answering, arXiv:1606.00061, 2016. [https://arxiv.org/pdf/1606.00061v2.pdf] [https://github.com/jiasenlu/HieCoAttenVQA]
  16. Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang, Multimodal Residual Learning for Visual QA, arXiv:1606.01455, 2016. [https://arxiv.org/pdf/1606.01455v1.pdf]
  17. Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, Anthony Dick, FVQA: Fact-based Visual Question Answering, arXiv:1606.05433, 2016. [https://arxiv.org/pdf/1606.05433.pdf]
  18. Ilija Ilievski, Shuicheng Yan, Jiashi Feng, A Focused Dynamic Attention Model for Visual Question Answering, arXiv:1604.01485. [https://arxiv.org/pdf/1604.01485v1.pdf]
  19. Yuke Zhu, Oliver Groth, Michael Bernstein, Li Fei-Fei, Visual7W: Grounded Question Answering in Images, CVPR 2016. [http://arxiv.org/abs/1511.03416]
  20. Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han, Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction, CVPR, 2016.[http://arxiv.org/pdf/1511.05756.pdf]
  21. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein, Learning to Compose Neural Networks for Question Answering, NAACL 2016. [http://arxiv.org/pdf/1601.01705.pdf]
  22. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein, Deep compositional question answering with neural module networks, CVPR 2016. [https://arxiv.org/abs/1511.02799]
  23. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola, Stacked Attention Networks for Image Question Answering, CVPR 2016. [http://arxiv.org/abs/1511.02274] [https://github.com/JamesChuanggg/san-torch]
  24. Kevin J. Shih, Saurabh Singh, Derek Hoiem, Where To Look: Focus Regions for Visual Question Answering, CVPR, 2015. [http://arxiv.org/pdf/1511.07394v2.pdf]
  25. Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, Ram Nevatia, ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering, arXiv:1511.05960v1, Nov 2015. [http://arxiv.org/pdf/1511.05960v1.pdf]
  26. Huijuan Xu, Kate Saenko, Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering, arXiv:1511.05234v1, Nov 2015. [http://arxiv.org/abs/1511.05234]
  27. Kushal Kafle and Christopher Kanan, Answer-Type Prediction for Visual Question Answering, CVPR 2016. [http://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Kafle_Answer-Type_Prediction_for_CVPR_2016_paper.html]
  28. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, VQA: Visual Question Answering, ICCV, 2015. [http://arxiv.org/pdf/1505.00468]
  29. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, VQA: Visual Question Answering, ICCV, 2015. [http://arxiv.org/pdf/1505.00468] [https://github.com/JamesChuanggg/VQA-tensorflow]
  30. Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, Rob Fergus, Simple Baseline for Visual Question Answering, arXiv:1512.02167v2, Dec 2015. [http://arxiv.org/abs/1512.02167]
  31. Hauyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, Wei Xu, Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering, NIPS 2015. [http://arxiv.org/pdf/1505.05612.pdf]
  32. Mateusz Malinowski, Marcus Rohrbach, Mario Fritz, Ask Your Neurons: A Neural-based Approach to Answering Questions about Images, ICCV 2015. [http://arxiv.org/pdf/1505.01121v3.pdf]
  33. Mengye Ren, Ryan Kiros, Richard Zemel, Exploring Models and Data for Image Question Answering, ICML 2015. [http://arxiv.org/pdf/1505.02074.pdf]
  34. Mateusz Malinowski, Mario Fritz, Towards a Visual Turing Challe, NIPS Workshop 2015. [http://arxiv.org/abs/1410.8027]
  35. Mateusz Malinowski, Mario Fritz, A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input, NIPS 2014. [http://arxiv.org/pdf/1410.0210v4.pdf]

Attention-Based

  1. Hedi Ben-younes, Remi Cadene, Matthieu Cord, Nicolas Thome: MUTAN: Multimodal Tucker Fusion for Visual Question Answering [https://arxiv.org/pdf/1705.06676.pdf] [https://github.com/Cadene/vqa.pytorch]
  2. Jin-Hwa Kim, Kyoung Woon On, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang, Hadamard Product for Low-rank Bilinear Pooling, arXiv:1610.04325, 2016. [https://arxiv.org/abs/1610.04325]
  3. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach, Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, arXiv:1606.01847, 2016. [https://arxiv.org/abs/1606.01847]
  4. Hyeonwoo Noh, Bohyung Han, Training Recurrent Answering Units with Joint Loss Minimization for VQA, arXiv:1606.03647, 2016. [http://arxiv.org/abs/1606.03647v1]
  5. Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh, Hierarchical Question-Image Co-Attention for Visual Question Answering, arXiv:1606.00061, 2016. [https://arxiv.org/pdf/1606.00061v2.pdf]
  6. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola, Stacked Attention Networks for Image Question Answering, CVPR 2016. [http://arxiv.org/abs/1511.02274]
  7. Ilija Ilievski, Shuicheng Yan, Jiashi Feng, A Focused Dynamic Attention Model for Visual Question Answering, arXiv:1604.01485. [https://arxiv.org/pdf/1604.01485v1.pdf]
  8. Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, Ram Nevatia, ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering, arXiv:1511.05960v1, Nov 2015. [http://arxiv.org/pdf/1511.05960v1.pdf]
  9. Huijuan Xu, Kate Saenko, Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering, arXiv:1511.05234v1, Nov 2015. [http://arxiv.org/abs/1511.05234]

Knowledge-based

  1. Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, Anthony Dick, FVQA: Fact-based Visual Question Answering, arXiv:1606.05433, 2016. [https://arxiv.org/pdf/1606.05433.pdf]
  2. Qi Wu, Peng Wang, Chunhua Shen, Anton van den Hengel, Anthony Dick, Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources, CVPR 2016. [http://arxiv.org/abs/1511.06973]
  3. Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, Anthony Dick, Explicit Knowledge-based Reasoning for Visual Question Answering, arXiv:1511.02570v2, Nov 2015. [http://arxiv.org/abs/1511.02570]
  4. Yuke Zhu, Ce Zhang, Christopher Re,́ Li Fei-Fei, Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries, arXiv:1507.05670, Nov 2015. [http://arxiv.org/abs/1507.05670]

Memory Network

  1. Caiming Xiong, Stephen Merity, Richard Socher, Dynamic Memory Networks for Visual and Textual Question Answering, ICML 2016. [http://arxiv.org/abs/1603.01417]
  2. Aiwen Jiang, Fang Wang, Fatih Porikli, Yi Li, Compositional Memory for Visual Question Answering, arXiv:1511.05676v1, Nov 2015. [http://arxiv.org/abs/1511.05676]

Video QA

  1. Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, Min Sun, Leveraging Video Descriptions to Learn Video Question Answering, AAAI 2017. [https://arxiv.org/abs/1611.04021]
    1. Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler, MovieQA: Understanding Stories in Movies through Question-Answering, CVPR 2016. [http://arxiv.org/abs/1512.02902]
    2. Linchao Zhu, Zhongwen Xu, Yi Yang, Alexander G. Hauptmann, Uncovering Temporal Context for Video Question and Answering, arXiv:1511.05676v1, Nov 2015. [http://arxiv.org/abs/1511.04670]

综述

  1. Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding [2017].
  2. Tutorial on Answering Questions about Images with Deep Learning Mateusz Malinowski, Mario Fritz
  3. Survey of Visual Question Answering: Datasets and Techniques
  4. Visual Question Answering: Datasets, Algorithms, and Future Challenges

Tutorial

  1. CVPR 2017 VQA Challenge Workshop  (有很多PPT)
  2. CVPR 2016 VQA Challenge Workshop 
  3. Tutorial on Answering Questions about Images with Deep Learning
  4. Visual Question Answering Demo in Python Notebook
  5. Tutorial on Question Answering about Images

Dataset

  1. Visual7W: Grounded Question Answering in Images
  2. DAQUAR
  3. COCO-QA
  4. The VQA Dataset
  5. FM-IQA
  6. Visual Genome

Code

  1. VQA Demo: Visual Question Answering Demo on pretrained model
  2. deep-qa: Implementation of the Convolution Neural Network for factoid QA on the answer sentence selection task
  3. YodaQA: A Question Answering system built on top of the Apache UIMA framework
  4. insuranceQA-cnn-lstm: tensorflow and theano cnn code for insurance QA
  5. Tensorflow Implementation of Deeper LSTM+ normalized CNN for Visual Question Answering
  6. Visual Question Answering with Keras
  7. Deep Learning Models for Question Answering with Keras
  8. Deep QA: Using deep learning to answer Aristo's science questions
  9. Visual Question Answering in Pytorch

领域专家

  1. Qi Wu
  2. Bolei Zhou 周博磊
  3. Stanislaw Antol
  4. Jin-Hwa Kim
  5. Vahid Kazemi
  6. Justin Johnson
  7. Ilija Ilievski

初步版本,水平有限,有错误或者不完善的地方,欢迎大家提建议和补充,会一直保持更新,本文为专知内容组原创内容,未经允许不得转载,如需转载请发送邮件至[email protected] 或 联系微信专知小助手(Rancho_Fang)

敬请关注http://www.zhuanzhi.ai 和关注专知公众号,获取第一手AI相关知识

展开全文
Top