视觉问答 - 专知荟萃

视觉问答（Visual Question Answering，VQA），是一种涉及计算机视觉和自然语言处理的学习任务。这一任务的定义如下： A VQA system takes as input an image and a free-form, open-ended, natural-language question about the image and produces a natural-language answer as the output[1]。翻译为中文：一个VQA系统以一张图片和一个关于这张图片形式自由、开放式的自然语言问题作为输入，以生成一条自然语言答案作为输出。简单来说，VQA就是给定的图片进行问答。

入门学习

基于深度学习的VQA（视觉问答）技术
- [https://zhuanlan.zhihu.com/p/22530291]
视觉问答全景概述：从数据集到技术方法
- https://mp.weixin.qq.com/s/dyor9bv2y0VyX7woMDVLkA
论文读书笔记（Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding）
- [http://www.jianshu.com/p/5bf03d1fadfa]
能看图回答问题的AI离我们还有多远？Facebook向视觉对话进发
- [https://www.leiphone.com/news/201711/4B9cNlCINsVyPdTw.html]
图像问答Image Question Answering
- [http://www.cnblogs.com/ranjiewen/p/7604468.html]
实战深度学习之图像问答
- [https://zhuanlan.zhihu.com/p/20899091]
2017 VQA Challenge 第一名技术报告
- [https://zhuanlan.zhihu.com/p/29688475]
深度学习为视觉和语言之间搭建了一座桥梁
- [http://www.msra.cn/zh-cn/news/features/vision-and-language-20170713]

综述

Visual Question Answering: A Tutorial
- https://ieeexplore.ieee.org/document/8103161
Information fusion in visual question answering: A Survey
- https://www.sciencedirect.com/science/article/pii/S1566253518308893
Visual Question Answering: Datasets, Methods, Challenges and Oppurtunities【2018】
- https://www.cs.princeton.edu/courses/archive/spring18/cos598B/public/projects/LiteratureReview/COS598B_spr2018_VQAreview.pdf
Visual Question Answering using Deep Learning: A Survey and Performance Analysis【2019】
- https://arxiv.org/pdf/1909.01860.pdf
Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding [2017].
- [https://arxiv.org/abs/1607.05910]
Tutorial on Answering Questions about Images with Deep Learning Mateusz Malinowski, Mario Fritz
- [https://arxiv.org/abs/1610.01076]
Survey of Visual Question Answering: Datasets and Techniques
- [https://arxiv.org/abs/1705.03865]
Visual Question Answering: Datasets, Algorithms, and Future Challenges
- [https://arxiv.org/abs/1610.01465]

进阶论文

Ranjay Krishna, Michael Bernstein, Li Fei-Fei:Information Maximizing Visual Question Generation CVPR2019 [https://arxiv.org/abs/1903.11207]
Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, Louis-Philippe Morency: Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence. CVPR 2019[http://openaccess.thecvf.com/content_CVPR_2019/html/Zadeh_Social-IQ_A_Question_Answering_Benchmark_for_Artificial_Social_Intelligence_CVPR_2019_paper.html]
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, Wei Liu: Learning to Compose Dynamic Tree Structures for Visual Contexts. CVPR 2019[https://arxiv.org/abs/1812.01880]
Hyeonwoo Noh, Taehoon Kim, Jonghwan Mun, Bohyung Han:**** Transfer Learning via Unsupervised Task Discovery for Visual Question Answering. CVPR 2019[https://arxiv.org/abs/1810.02358]
Yao-Hung Hubert Tsai, Santosh Kumar Divvala, Louis-Philippe Morency, Ruslan Salakhutdinov, Ali Farhadi: Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph. CVPR 2019[https://arxiv.org/abs/1903.10547] [https://github.com/yaohungt/Gated-Spatio-Temporal-Energy-Graph]
Jiaxin Shi, Hanwang Zhang, Juanzi Li: Explainable and Explicit Visual Reasoning Over Scene Graphs. CVPR 2019[https://arxiv.org/abs/1812.01855] [https://github.com/shijx12/XNM-Net]
Rémi Cadène, Hedi Ben-younes, Matthieu Cord, Nicolas Thome:**** MUREL: Multimodal Relational Reasoning for Visual Question Answering. CVPR 2019:[https://arxiv.org/abs/1902.09487] [https://github.com/Cadene/murel.bootstrap.pytorch]
Dalu Guo, Chang Xu, Dacheng Tao:**** Image-Question-Answer Synergistic Network for Visual Dialog. CVPR 2019: [https://arxiv.org/abs/1902.09774]
Chenfei Wu, Jinlai Liu, Xiaojie Wang, Ruifan Li: Differential Networks for Visual Question Answering. AAAI 2019: [https://www.aaai.org/Papers/AAAI/2019/AAAI-WuC.76.pdf]
Hedi Ben-younes, Rémi Cadène, Nicolas Thome, Matthieu Cord: BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection. AAAI 2019:
Chenfei Wu, Jinlai Liu, Xiaojie Wang, Xuan Dong: Chain of Reasoning for Visual Question Answering. NeurIPS 2018: [https://papers.nips.cc/paper/7311-chain-of-reasoning-for-visual-question-answering]
Will Norcliffe-Brown, Stathis Vafeias, Sarah Parisot: Learning Conditioned Graph Structures for Interpretable Visual Question Answering. NeurIPS 2018: [https://papers.nips.cc/paper/8054-learning-conditioned-graph-structures-for-interpretable-visual-question-answering] [https://github.com/aimbrain/vqa-project]
Medhini Narasimhan, Svetlana Lazebnik, Alexander G. Schwing: Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering. NeurIPS 2018: [https://papers.nips.cc/paper/7531-out-of-the-box-reasoning-with-graph-convolution-nets-for-factual-visual-question-answering]
Sainandan Ramakrishnan, Aishwarya Agrawal, Stefan Lee: Overcoming Language Priors in Visual Question Answering with Adversarial Regularization. NeurIPS 2018: [https://papers.nips.cc/paper/7427-overcoming-language-priors-in-visual-question-answering-with-adversarial-regularization]
Somak Aditya, Yezhou Yang, Chitta Baral: Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering. AAAI 2018: [https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16446]
Zhihao Fan, Zhongyu Wei, Piji Li, Yanyan Lan, Xuanjing Huang: A Question Type Driven Framework to Diversify Visual Question Generation. IJCAI 2018: [https://www.ijcai.org/proceedings/2018/563]
Damien Teney, Peter Anderson, Xiaodong He, Anton van den Hengel: Tips and Tricks for Visual Question Answering: Learnings From the 2017 Challenge. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Teney_Tips_and_Tricks_CVPR_2018_paper.html]
Ishan Misra, Ross B. Girshick, Rob Fergus, Martial Hebert, Abhinav Gupta, Laurens van der Maaten: Learning by Asking Questions. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Misra_Learning_by_Asking_CVPR_2018_paper.html]
Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra: Embodied Question Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Das_Embodied_Question_Answering_CVPR_2018_paper.html]
Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, Jeffrey P. Bigham: VizWiz Grand Challenge: Answering Visual Questions From Blind People. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Gurari_VizWiz_Grand_Challenge_CVPR_2018_paper.html]
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, Aniruddha Kembhavi: Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Agrawal_Dont_Just_Assume_CVPR_2018_paper.html]
Hexiang Hu, Wei-Lun Chao, Fei Sha: Learning Answer Embeddings for Visual Question Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Learning_Answer_Embeddings_CVPR_2018_paper.html]
Wei-Lun Chao, Hexiang Hu, Fei Sha: Cross-Dataset Adaptation for Visual Question Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Chao_Cross-Dataset_Adaptation_for_CVPR_2018_paper.html]
Unnat Jain, Svetlana Lazebnik, Alexander G. Schwing: Two Can Play This Game: Visual Dialog With Discriminative Question Generation and Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Jain_Two_Can_Play_CVPR_2018_paper.html]
Qingxing Cao, Xiaodan Liang, Bailing Li, Guanbin Li, Liang Lin: Visual Question Reasoning on General Dependency Tree. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Cao_Visual_Question_Reasoning_CVPR_2018_paper.html]
Feng Liu, Tao Xiang, Timothy M. Hospedales, Wankou Yang, Changyin Sun: IVQA: Inverse Visual Question Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Liu_IVQA_Inverse_Visual_CVPR_2018_paper.html]
Andrew Shin, Yoshitaka Ushiku, Tatsuya Harada: Customized Image Narrative Generation via Interactive Visual Question Generation and Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Shin_Customized_Image_Narrative_CVPR_2018_paper.html]
Chenfei Wu, Jinlai Liu, Xiaojie Wang, Xuan Dong: Object-Difference Attention: A Simple Relational Attention for Visual Question Answering. ACM Multimedia 2018: [https://dl.acm.org/citation.cfm?doid=3240508.3240513]
Zhiwei Fang, Jing Liu, Yanyuan Qiao, Qu Tang, Yong Li, Hanqing Lu: Enhancing Visual Question Answering Using Dropout. ACM Multimedia 2018: [https://doi.org/10.1145/3240508.3240662]
Xuanyi Dong, Linchao Zhu, De Zhang, Yi Yang, Fei Wu: Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering. ACM Multimedia 2018: [https://doi.org/10.1145/3240508.3240527]
Xiaomeng Song, Yucheng Shi, Xin Chen, Yahong Han: Explore Multi-Step Reasoning in Video Question Answering. ACM Multimedia 2018: [https://doi.org/10.1145/3240508.3240563]
Damien Teney, Anton van den Hengel: Visual Question Answering as a Meta Learning Task. ECCV (15) 2018: [http://openaccess.thecvf.com/content_ECCV_2018/html/Damien_Teney_Visual_Question_Answering_ECCV_2018_paper.html]
Peng Gao, Hongsheng Li, Shuang Li, Pan Lu, Yikang Li, Steven C. H. Hoi, Xiaogang Wang: Question-Guided Hybrid Convolution for Visual Question Answering. ECCV (1) 2018: [http://openaccess.thecvf.com/content_ECCV_2018/html/gao_peng_Question-Guided_Hybrid_Convolution_ECCV_2018_paper.html]
Youngjae Yu, Jongseok Kim, Gunhee Kim: A Joint Sequence Fusion Model for Video Question Answering and Retrieval. ECCV (7) 2018: [http://openaccess.thecvf.com/content_ECCV_2018/html/Youngjae_Yu_A_Joint_Sequence_ECCV_2018_paper.html]
Medhini Narasimhan, Alexander G. Schwing: Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering. ECCV (8) 2018: [http://openaccess.thecvf.com/content_ECCV_2018/html/Medhini_Gulganjalli_Narasimhan_Straight_to_the_ECCV_2018_paper.html]
Kohei Uehara, Antonio Tejero-de-Pablos, Yoshitaka Ushiku, Tatsuya Harada: Visual Question Generation for Class Acquisition of Unknown Objects. ECCV (12) 2018: [http://openaccess.thecvf.com/content_ECCV_2018/html/Kohei_Uehara_Visual_Question_Generation_ECCV_2018_paper.html]
Peng Wang, Qi Wu, Chunhua Shen, Anthony R. Dick, Anton van den Hengel: FVQA: Fact-Based Visual Question Answering. IEEE Trans. Pattern Anal. Mach. Intell. 40(10): [https://arxiv.org/abs/1606.05433]
Alexander Trott, Caiming Xiong, Richard Socher: Interpretable Counting for Visual Question Answering. ICLR (Poster) 2018 [https://arxiv.org/abs/1712.08697]
Yan Zhang, Jonathon S. Hare, Adam Prügel-Bennett: Learning to Count Objects in Natural Images for Visual Question Answering. ICLR (Poster) 2018 [https://openreview.net/forum?id=B12Js_yRb]
Kushal Kafle, and Christopher Kanan.

Visual question answering: Datasets, algorithms, and future challenges.

Computer Vision and Image Understanding [2017].
- [https://arxiv.org/abs/1610.01465]
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick, CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, CVPR 2017.
- [http://vision.stanford.edu/pdf/johnson2017cvpr.pdf]
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick, Inferring and Executing Programs for Visual Reasoning, arXiv:1705.03633, 2017. [https://arxiv.org/abs/1705.03633]
Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko, Learning to Reason: End-to-End Module Networks for Visual Question Answering, arXiv:1704.05526, 2017. [https://arxiv.org/abs/1704.05526]
Adam Santoro, David Raposo, David G.T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, Timothy Lillicrap, A simple neural network module for relational reasoning, arXiv:1706.01427, 2017. [https://arxiv.org/abs/1706.01427]
Hedi Ben-younes, Remi Cadene, Matthieu Cord, Nicolas Thome: MUTAN: Multimodal Tucker Fusion for Visual Question Answering [https://arxiv.org/pdf/1705.06676.pdf] [https://github.com/Cadene/vqa.pytorch]
Vahid Kazemi, Ali Elqursh, Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering, arXiv:1704.03162, 2016. [https://arxiv.org/abs/1704.03162] [https://github.com/Cyanogenoid/pytorch-vqa]
Kushal Kafle, and Christopher Kanan. An Analysis of Visual Question Answering Algorithms. arXiv:1703.09684, 2017. [https://arxiv.org/abs/1703.09684]
Hyeonseob Nam, Jung-Woo Ha, Jeonghee Kim, Dual Attention Networks for Multimodal Reasoning and Matching, arXiv:1611.00471, 2016. [https://arxiv.org/abs/1611.00471]
Jin-Hwa Kim, Kyoung Woon On, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang, Hadamard Product for Low-rank Bilinear Pooling, arXiv:1610.04325, 2016. [https://arxiv.org/abs/1610.04325]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach, Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, arXiv:1606.01847, 2016. [https://arxiv.org/abs/1606.01847] [https://github.com/akirafukui/vqa-mcb]
Kuniaki Saito, Andrew Shin, Yoshitaka Ushiku, Tatsuya Harada, DualNet: Domain-Invariant Network for Visual Question Answering. arXiv:1606.06108v1, 2016. [https://arxiv.org/pdf/1606.06108.pdf]
Arijit Ray, Gordon Christie, Mohit Bansal, Dhruv Batra, Devi Parikh, Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions, arXiv:1606.06622, 2016. [https://arxiv.org/pdf/1606.06622v1.pdf]
Hyeonwoo Noh, Bohyung Han, Training Recurrent Answering Units with Joint Loss Minimization for VQA, arXiv:1606.03647, 2016. [http://arxiv.org/abs/1606.03647v1]
Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh, Hierarchical Question-Image Co-Attention for Visual Question Answering, arXiv:1606.00061, 2016. [https://arxiv.org/pdf/1606.00061v2.pdf] [https://github.com/jiasenlu/HieCoAttenVQA]
Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang, Multimodal Residual Learning for Visual QA, arXiv:1606.01455, 2016. [https://arxiv.org/pdf/1606.01455v1.pdf]
Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, Anthony Dick, FVQA: Fact-based Visual Question Answering, arXiv:1606.05433, 2016. [https://arxiv.org/pdf/1606.05433.pdf]
Ilija Ilievski, Shuicheng Yan, Jiashi Feng, A Focused Dynamic Attention Model for Visual Question Answering, arXiv:1604.01485. [https://arxiv.org/pdf/1604.01485v1.pdf]
Yuke Zhu, Oliver Groth, Michael Bernstein, Li Fei-Fei, Visual7W: Grounded Question Answering in Images, CVPR 2016. [http://arxiv.org/abs/1511.03416]
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han, Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction, CVPR, 2016.[http://arxiv.org/pdf/1511.05756.pdf]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein, Learning to Compose Neural Networks for Question Answering, NAACL 2016. [http://arxiv.org/pdf/1601.01705.pdf]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein, Deep compositional question answering with neural module networks, CVPR 2016. [https://arxiv.org/abs/1511.02799]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola, Stacked Attention Networks for Image Question Answering, CVPR 2016. [http://arxiv.org/abs/1511.02274] [https://github.com/JamesChuanggg/san-torch]
Kevin J. Shih, Saurabh Singh, Derek Hoiem, Where To Look: Focus Regions for Visual Question Answering, CVPR, 2015. [http://arxiv.org/pdf/1511.07394v2.pdf]
Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, Ram Nevatia, ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering, arXiv:1511.05960v1, Nov 2015. [http://arxiv.org/pdf/1511.05960v1.pdf]
Huijuan Xu, Kate Saenko, Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering, arXiv:1511.05234v1, Nov 2015. [http://arxiv.org/abs/1511.05234]
Kushal Kafle and Christopher Kanan, Answer-Type Prediction for Visual Question Answering, CVPR 2016. [http://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Kafle_Answer-Type_Prediction_for_CVPR_2016_paper.html]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, VQA: Visual Question Answering, ICCV, 2015. [http://arxiv.org/pdf/1505.00468]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, VQA: Visual Question Answering, ICCV, 2015. [http://arxiv.org/pdf/1505.00468] [https://github.com/JamesChuanggg/VQA-tensorflow]
Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, Rob Fergus, Simple Baseline for Visual Question Answering, arXiv:1512.02167v2, Dec 2015. [http://arxiv.org/abs/1512.02167]
Hauyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, Wei Xu, Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering, NIPS 2015. [http://arxiv.org/pdf/1505.05612.pdf]
Mateusz Malinowski, Marcus Rohrbach, Mario Fritz, Ask Your Neurons: A Neural-based Approach to Answering Questions about Images, ICCV 2015. [http://arxiv.org/pdf/1505.01121v3.pdf]
Mengye Ren, Ryan Kiros, Richard Zemel, Exploring Models and Data for Image Question Answering, ICML 2015. [http://arxiv.org/pdf/1505.02074.pdf]
Mateusz Malinowski, Mario Fritz, Towards a Visual Turing Challe, NIPS Workshop 2015. [http://arxiv.org/abs/1410.8027]
Mateusz Malinowski, Mario Fritz, A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input, NIPS 2014. [http://arxiv.org/pdf/1410.0210v4.pdf]

Attention-Based

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, Qi Tian: Deep Modular Co-Attention Networks for Visual Question Answering

[http://openaccess.thecvf.com/content_CVPR_2019/papers/Yu_Deep_Modular_Co-Attention_Networks_for_Visual_Question_Answering_CVPR_2019_paper.pdf] [https://github.com/MILVLG/mcan-vqa]
Yiyi Zhou, Rongrong Ji, Jinsong Su, Xiaoshuai Sun, Weiqiu Chen: Dynamic Capsule Attention for Visual Question Answering. AAAI 2019: [https://www.aaai.org/Papers/AAAI/2019/AAAI-ZhouYiyi2.3610.pdf] [https://github.com/XMUVQA/CapsAtt]
Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei, Heng Tao Shen: Structured Two-Stream Attention Network for Video Question Answering. AAAI 2019:[https://www.aaai.org/ojs/index.php/AAAI/article/view/4602]
Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, Chuang Gan: Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering. AAAI 2019: [https://www.semanticscholar.org/paper/Beyond-RNNs%3A-Positional-Self-Attention-with-for-Li-Song/565359aac8914505e6b02db05822ee63d3ffd03a] [https://github.com/lixiangpengcs/PSAC]
Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, Xiaogang Wang: Co-Attending Free-Form Regions and Detections With Multi-Modal Multiplicative Feature Embedding for Visual Question Answering. AAAI 2018: [https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16249] [https://github.com/lupantech/dual-mfa-vqa]
Tingting Qiao, Jianfeng Dong, Duanqing Xu: Exploring Human-Like Attention Supervision in Visual Question Answering. AAAI 2018: [https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16485]
Yuetan Lin, Zhangyang Pang, Donghui Wang, Yueting Zhuang: Feature Enhancement in Attention for Visual Question Answering. IJCAI 2018: [https://www.ijcai.org/proceedings/2018/586]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang: Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Anderson_Bottom-Up_and_Top-Down_CVPR_2018_paper.html] [https://github.com/peteanderson80/bottom-up-attention] [https://github.com/facebookresearch/pythia] [https://github.com/hengyuan-hu/bottom-up-attention-vqa ]
Duy-Kien Nguyen, Takayuki Okatani: Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Nguyen_Improved_Fusion_of_CVPR_2018_paper.html]
Yikang Li, Nan Duan, Bolei Zhou, Xiao Chu, Wanli Ouyang, Xiaogang Wang, Ming Zhou: Visual Question Generation as Dual Task of Visual Question Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Li_Visual_Question_Generation_CVPR_2018_paper.html]
Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, Alexander G. Hauptmann: Focal Visual-Text Attention for Visual Question Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Liang_Focal_Visual-Text_Attention_CVPR_2018_paper.html]
Badri N. Patro, Vinay P. Namboodiri: Differential Attention for Visual Question Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Patro_Differential_Attention_for_CVPR_2018_paper.html]
Yang Shi, Tommaso Furlanello, Sheng Zha, Animashree Anandkumar: Question Type Guided Attention in Visual Question Answering. ECCV (4) 2018: [http://openaccess.thecvf.com/content_ECCV_2018/html/Yang_Shi_Question_Type_Guided_ECCV_2018_paper.html]
Mateusz Malinowski, Carl Doersch, Adam Santoro, Peter W. Battaglia: Learning Visual Question Answering by Bootstrapping Hard Attention. ECCV (6) 2018: [http://openaccess.thecvf.com/content_ECCV_2018/html/Mateusz_Malinowski_Learning_Visual_Question_ECCV_2018_paper.html]
Pan Lu, Lei Ji, Wei Zhang, Nan Duan, Ming Zhou, Jianyong Wang: R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering. KDD 2018: [https://dl.acm.org/citation.cfm?doid=3219819.3220036]
Hedi Ben-younes, Remi Cadene, Matthieu Cord, Nicolas Thome: MUTAN: Multimodal Tucker Fusion for Visual Question Answering [https://arxiv.org/pdf/1705.06676.pdf] [https://github.com/Cadene/vqa.pytorch]
Jin-Hwa Kim, Kyoung Woon On, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang, Hadamard Product for Low-rank Bilinear Pooling, arXiv:1610.04325, 2016. [https://arxiv.org/abs/1610.04325]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach, Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, arXiv:1606.01847, 2016. [https://arxiv.org/abs/1606.01847]
Hyeonwoo Noh, Bohyung Han, Training Recurrent Answering Units with Joint Loss Minimization for VQA, arXiv:1606.03647, 2016. [http://arxiv.org/abs/1606.03647v1]
Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh, Hierarchical Question-Image Co-Attention for Visual Question Answering, arXiv:1606.00061, 2016. [https://arxiv.org/pdf/1606.00061v2.pdf]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola, Stacked Attention Networks for Image Question Answering, CVPR 2016. [http://arxiv.org/abs/1511.02274]
Ilija Ilievski, Shuicheng Yan, Jiashi Feng, A Focused Dynamic Attention Model for Visual Question Answering, arXiv:1604.01485. [https://arxiv.org/pdf/1604.01485v1.pdf]
Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, Ram Nevatia, ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering, arXiv:1511.05960v1, Nov 2015. [http://arxiv.org/pdf/1511.05960v1.pdf]
Huijuan Xu, Kate Saenko, Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering, arXiv:1511.05234v1, Nov 2015. [http://arxiv.org/abs/1511.05234]

Knowledge-based

Yiyi Zhou, Rongrong Ji, Jinsong Su, Xiangming Li, Xiaoshuai Sun: Free VQA Models from Knowledge Inertia by Pairwise Inconformity Learning. AAAI 2019: [https://www.aaai.org/Papers/AAAI/2019/AAAI-ZhouYiyi1.1233.pdf] [https://github.com/xiangmingLi/PIL]
Jonghwan Mun, Kimin Lee, Jinwoo Shin, Bohyung Han: Learning to Specialize with Knowledge Distillation for Visual Question Answering. NeurIPS 2018 [https://papers.nips.cc/paper/8031-learning-to-specialize-with-knowledge-distillation-for-visual-question-answering]
Qi Wu, Chunhua Shen, Peng Wang, Anthony R. Dick, Anton van den Hengel: Image Captioning and Visual Question Answering Based on Attributes and External Knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40(6): [https://arxiv.org/abs/1603.02814]
Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, Anthony Dick, FVQA: Fact-based Visual Question Answering, arXiv:1606.05433, 2016. [https://arxiv.org/pdf/1606.05433.pdf]
Qi Wu, Peng Wang, Chunhua Shen, Anton van den Hengel, Anthony Dick, Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources, CVPR 2016. [http://arxiv.org/abs/1511.06973]
Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, Anthony Dick, Explicit Knowledge-based Reasoning for Visual Question Answering, arXiv:1511.02570v2, Nov 2015. [http://arxiv.org/abs/1511.02570]
Yuke Zhu, Ce Zhang, Christopher Re,́ Li Fei-Fei, Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries, arXiv:1507.05670, Nov 2015. [http://arxiv.org/abs/1507.05670]

Memory Network

Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, Jeffrey P. Bigham: VizWiz Grand Challenge: Answering Visual Questions From Blind People. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Li_Textbook_Question_Answering_CVPR_2018_paper.html] [https://github.com/freerailway/igmn]
Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, Ali Farhadi: IQA: Visual Question Answering in Interactive Environments. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Gordon_IQA_Visual_Question_CVPR_2018_paper.html] [https://youtu.be/pXd3C-1jr98https://youtu.be/pXd3C-1jr98]
Jiyang Gao, Runzhou Ge, Kan Chen, Ram Nevatia: Motion-Appearance Co-Memory Networks for Video Question Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Gao_Motion-Appearance_Co-Memory_Networks_CVPR_2018_paper.html]
Chao Ma, Chunhua Shen, Anthony R. Dick, Qi Wu, Peng Wang, Anton van den Hengel, Ian D. Reid: Visual Question Answering With Memory-Augmented Networks. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Ma_Visual_Question_Answering_CVPR_2018_paper.html]
Zhou Su, Chen Zhu, Yinpeng Dong, Dongqi Cai, Yurong Chen, Jianguo Li: Learning Visual Knowledge Memory Networks for Visual Question Answering. CVPR 2018: [http://openaccess.thecvf.com/content_cvpr_2018/html/Su_Learning_Visual_Knowledge_CVPR_2018_paper.html]
Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, Byoung-Tak Zhang: Multimodal Dual Attention Memory for Video Story Question Answering. ECCV (15) 2018: [http://openaccess.thecvf.com/content_ECCV_2018/html/Kyungmin_Kim_Multimodal_Dual_Attention_ECCV_2018_paper.html]
Caiming Xiong, Stephen Merity, Richard Socher, Dynamic Memory Networks for Visual and Textual Question Answering, ICML 2016. [http://arxiv.org/abs/1603.01417]
Aiwen Jiang, Fang Wang, Fatih Porikli, Yi Li, Compositional Memory for Visual Question Answering, arXiv:1511.05676v1, Nov 2015. [http://arxiv.org/abs/1511.05676]

Video QA

Bo Wang, Youjiang Xu, Yahong Han, Richang Hong: Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents. AAAI 2018: [https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16359]
Zhou Zhao, Xinghua Jiang, Deng Cai, Jun Xiao, Xiaofei He, Shiliang Pu: Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network. IJCAI 2018: https://www.ijcai.org/proceedings/2018/513
Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhou Yu, Jun Yu, Deng Cai, Fei Wu, Yueting Zhuang: Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks. IJCAI 2018: [https://www.ijcai.org/proceedings/2018/512]
Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, Min Sun, Leveraging Video Descriptions to Learn Video Question Answering, AAAI 2017. [https://arxiv.org/abs/1611.04021]
Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler, MovieQA: Understanding Stories in Movies through Question-Answering, CVPR 2016. [http://arxiv.org/abs/1512.02902]
Linchao Zhu, Zhongwen Xu, Yi Yang, Alexander G. Hauptmann, Uncovering Temporal Context for Video Question and Answering, arXiv:1511.05676v1, Nov 2015. [http://arxiv.org/abs/1511.04670]

Embodied Question Answering

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra: Embodied Question Answering. CVPR 2018: [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8575449&tag=1]
Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra: Multi-Target Embodied Question Answering. CVPR 2019: [https://arxiv.org/pdf/1904.04686.pdf]
Juncheng Li, Siliang Tang, Fei Wu, Yueting Zhuang: Walking with MIND: Mental Imagery eNhanceD Embodied QA. ACM Multimedia 2019: [https://dl.acm.org/citation.cfm?doid=3343031.3351017]
Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, Dhruv Batra: Embodied Question Answering in Photorealistic Environments With Point Cloud Perception. CVPR 2019: https://arxiv.org/pdf/1904.03461.pdf

Tutorial

CVPR 2019 VQA Challenge
- https://visualqa.org/challenge.html
CVPR 2018 VQA Challenge and Visual Dialog Workshop
- https://visualqa.org/workshop_2018.html
CVPR 2017 VQA Challenge Workshop （有很多PPT）
- [http://www.visualqa.org/workshop.html]
CVPR 2016 VQA Challenge Workshop
- [http://www.visualqa.org/vqa_v1_workshop.html]
Tutorial on Answering Questions about Images with Deep Learning
- [https://arxiv.org/pdf/1610.01076.pdf]
Visual Question Answering Demo in Python Notebook
- [http://iamaaditya.github.io/2016/04/visual_question_answering_demo_notebook]
Tutorial on Question Answering about Images
- [https://www.linkedin.com/pulse/tutorial-question-answering-images-mateusz-malinowski/]

数据集

Visual7W: Grounded Question Answering in Images
- homepage: http://web.stanford.edu/~yukez/visual7w/
- github: https://github.com/yukezhu/visual7w-toolkit
- github: https://github.com/yukezhu/visual7w-qa-models
DAQUAR
- [http://www.cs.toronto.edu/~mren/imageqa/results/]
COCO-QA
- [http://www.cs.toronto.edu/~mren/imageqa/data/cocoqa/]
The VQA Dataset
- [http://visualqa.org/]
FM-IQA
- [http://idl.baidu.com/FM-IQA.html]
Visual Genome
- [http://visualgenome.org/]
RAVEN: A Dataset for Relational and Analogical Visual rEasoNing
- https://arxiv.org/abs/1903.02741
- http://wellyzhang.github.io/project/raven.html
KVQA: Knowledge-Aware Visual Question Answering.
- https://github.com/sanket0211/WK-VQA#

代码

VQA Demo: Visual Question Answering Demo on pretrained model
- [https://github.com/iamaaditya/VQA_Demo]
- [http://iamaaditya.github.io/research/]
deep-qa: Implementation of the Convolution Neural Network for factoid QA on the answer sentence selection task
- [https://github.com/aseveryn/deep-qa]
YodaQA: A Question Answering system built on top of the Apache UIMA framework
- [http://ailao.eu/yodaqa/]
- [https://github.com/brmson/yodaqa]
insuranceQA-cnn-lstm: tensorflow and theano cnn code for insurance QA
- [https://github.com/white127/insuranceQA-cnn-lstm]
Tensorflow Implementation of Deeper LSTM+ normalized CNN for Visual Question Answering
- [https://github.com/JamesChuanggg/VQA-tensorflow]
Visual Question Answering with Keras
- [https://anantzoid.github.io/VQA-Keras-Visual-Question-Answering/]
Deep Learning Models for Question Answering with Keras
- [http://sujitpal.blogspot.jp/2016/10/deep-learning-models-for-question.html]
Deep QA: Using deep learning to answer Aristo's science questions
- [https://github.com/allenai/deep_qa]
Visual Question Answering in Pytorch
- [https://github.com/Cadene/vqa.pytorch]

领域专家

Qi Wu

Qi Wu博士目前是阿德莱德大学的高级讲师，他是澳大利亚阿德莱德大学澳大利亚机器人视觉中心（ACRV）的ARC高级研究助理。在此之前，他在澳大利亚视觉技术中心（ACVT）担任博士后。他分别于2011年和2015年在英国巴斯大学获得了全球计算和媒体技术硕士学位与获得了计算机科学博士学位。他的研究兴趣包括跨描述风格的对象建模，对象检测和视觉到语言。他对图像字幕和视觉问答特别感兴趣。他的图像字幕模型在去年的Microsoft COCO图像字幕挑战赛中取得了最佳成绩，而他的VQA模型是当前的最佳技术。他的论文已发表在著名的期刊和会议上，例如TPAMI，CVPR，ICCV和ECCV。

[https://researchers.adelaide.edu.au/profile/qi.wu01]
Bolei Zhou 周博磊

CUHK信息工程系助理教授，研究方向为机器感知与决策、计算机视觉等，2018年毕业于MIT，师从 Antonio Torralba。曾获Facebook Ph.D. Fellowship in Computer Vision、BRC Fellowship Award、MIT Greater China Computer Science Fellowship等。在NeuralPS、TPAMI、IJCV、ICCV、CVPR、ECCV等期刊与会议已发表论文近50篇。

[http://bzhou.ie.cuhk.edu.hk/]
Stanislaw Antol

现任职于梅赛德斯-奔驰研发部门的自动驾驶汽车团队。从2018年3月到2019年3月，为Traptic的计算机视觉工程师，致力于草莓采摘机器人的研究。从2016年7月到2018年2月，在三星研究美国公司的智库团队的计算机视觉实验室担任研究工程师。

[https://computing.ece.vt.edu/~santol/]
Jin-Hwa Kim

自2018年8月，在SK T-Brain任研究科学家。研究方向为多模式深度学习，主要致力于视觉问题解答和其他相关主题。 2017年9月，他获得了2017年 Google Ph.D. Fellowship。首尔国立大学的机器学习奖学金和并在首尔国立大学完成博士学位。他于2017年1月至5月在Facebook AI Research实习，是从Tian Yuandong指导。2018年，他获得了博士学位。由首尔国立大学教授张秉德教授指导。他于2015年获得首尔国立大学的工程学硕士学位，并于2011年获得了广云大学的工程学学士学位（优等生）。从2011年到2012年，为SK Communications（大韩民国）的搜索基础设施开发团队的软件工程师。

[http://wityworks.com/]
Justin Johnson

现任Michigan大学计算机科学与工程学院助理教授。研究方向为视觉推理、语言与视觉、用深度网络生成图像等。斯坦福大学博士，师从李飞飞。CS231N讲师之一。
- [http://cs.stanford.edu/people/jcjohns/]
Ilija Ilievski

新加坡国立大学博士，研究方向为视觉问答。CVPR2017 VQA challenge第3名。

[https://ilija139.github.io/]