Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA). Compared to general domain VQA, the performance of biomedical VQA suffers from limited data. In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA to overcome the data limitation issue. Specifically, we collect a new biomedical dataset named PMCPM which offers patient-based image-text pairs containing diverse patient situations from PubMed. Then, we pretrain the biomedical multi-modal model to learn visual and textual representation for image-text pairs and align these representations with image-text contrastive objective (ITC). Finally, we propose a retrieval-augmented method to better use the limited data. We propose to retrieve similar image-text pairs based on ITC from pretraining datasets and introduce a novel retrieval-attention module to fuse the representation of the image and the question with the retrieved images and texts. Experiments demonstrate that our retrieval-augmented pretrain-and-finetune paradigm obtains state-of-the-art performance on Med-VQA2019, Med-VQA2021, VQARAD, and SLAKE datasets. Further analysis shows that the proposed RAMM and PMCPM can enhance biomedical VQA performance compared with previous resources and methods. We will open-source our dataset, codes, and pretrained model.
翻译:与一般域域域域域域域域域域域域域域域域别相比,生物医学域域域域域域域域域域域域域域域域域域域域域域域别业绩的表现取决于有限的数据。在本文中,我们提议采用一个称为RAMM的检索强化前端和Finnetune范式,称为RAMM,用于生物科域域域域域域域域域域域域域域域域别,以克服数据限制问题。具体地说,我们收集了一个新的生物医学数据集,名为PMCPMP, 提供基于病人的图像-文本配对,包含普布麦德的不同病人情况。然后,我们预设了生物医学多模式,以学习图像-文本对配对的视觉和文字表达方式,并将这些表述与图像-文字对比目标(IT);最后,我们提议采用一个检索后端域域域域域域域域域域域域域域域域域域域域域域域域域域域域图,用以改进我们以前的图像-前-前端点域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域和域域域域域域域域域域域域域域域域域域域域域域域域,进一步显示数据分析。</s>