Visual question answering (VQA) is a hallmark of vision and language reasoning and a challenging task under the zero-shot setting. We propose Plug-and-Play VQA (PNP-VQA), a modular framework for zero-shot VQA. In contrast to most existing works, which require substantial adaptation of pretrained language models (PLMs) for the vision modality, PNP-VQA requires no additional training of the PLMs. Instead, we propose to use natural language and network interpretation as an intermediate representation that glues pretrained models together. We first generate question-guided informative image captions, and pass the captions to a PLM as context for question answering. Surpassing end-to-end trained baselines, PNP-VQA achieves state-of-the-art results on zero-shot VQAv2 and GQA. With 11B parameters, it outperforms the 80B-parameter Flamingo model by 8.5% on VQAv2. With 738M PLM parameters, PNP-VQA achieves an improvement of 9.1% on GQA over FewVLM with 740M PLM parameters. Code is released at https://github.com/salesforce/LAVIS/tree/main/projects/pnp-vqa
翻译:视觉问题解答(VQA)是愿景和语言推理的标志,也是零点设置下一项具有挑战性的任务。我们提议为零点VQA提供一个模块框架,即Plug-和Play VQA(PNP-VQA),作为零点VQA的模块框架。与大多数现有工程相比,PNP-VQA需要为愿景模式大量调整预先培训的语言模型(PLM),PNP-VQA不需要对PLM进行额外培训。相反,我们提议使用自然语言和网络解释作为中间代表,将预训练模型粘合在一起。我们首先生成了问题引导的信息图像说明,并将标题传递给PLM(PLM),作为回答问题的背景。超过端到端培训基线,PNPNP-VQA达到零点VAv2和GQA的最新成果。有11B参数,比VQAVM的80参数高出8.5%。 PLMM参数,PNP-VA在GAA/MA/FROM/FILM/FIOS/FALM/MSLSIMSUDMSUDMSIMSUDMSIMSIMSIMSUDRMSUDRMSUDRMSMSIM改进9.1. 9.1%。