We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at http://cvit.iiit.ac.in/docvqa/
翻译:我们对称为DocVQA的文件图像提出了一套新的视觉问答数据集。数据集包括12 000+文件图像上界定的50 000个问题。对数据集与VQA和阅读理解的类似数据集进行了详细分析。我们通过采用现有的VQA和阅读理解模型报告了若干基准结果。虽然现有模型在某些类型的问题上表现较好,但与人类性能相比,业绩差距很大(94.36%的准确度)。模型需要具体改进文件理解结构至关重要的问题。数据集、代码和领导板可在http://cvit.iiit.ac.in/docvqa上查阅。