Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence (AI) by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset consisting of 2,757 whole-body PET/CT volumes from independent patients and their corresponding full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs' learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, especially for low-resource languages and clinical use in Vietnamese healthcare. The source code is available at https://github.com/AIoT-Lab-BKAI/ViPET-ReportGen.
翻译:视觉-语言基础模型通过在大规模多模态数据集上进行训练,推动了人工智能领域的显著进步,实现了丰富的跨模态推理。尽管这些模型在通用领域取得了成功,但由于多样化成像模态和多语言临床数据的可用性有限,将其应用于医学影像仍面临挑战。现有的大多数医学视觉-语言模型仅在部分成像模态上进行训练,且主要关注高资源语言,这限制了其泛化能力和临床实用性。为应对这些局限,我们引入了一个新颖的越南语多模态医学数据集,该数据集包含来自独立患者的2,757个全身PET/CT影像体积及其对应的完整临床报告。此数据集旨在填补医学人工智能发展中的两个紧迫缺口:(1)现有视觉-语言模型训练语料库中PET/CT影像数据的缺乏,这阻碍了能够处理功能成像任务的模型开发;(2)低资源语言,特别是越南语,在医学视觉-语言研究中的代表性不足。据我们所知,这是首个提供越南语全面PET/CT-报告配对的数据集。我们进一步引入了一个训练框架以增强视觉-语言模型的学习能力,包括数据增强和专家验证的测试集。我们在下游任务上对最先进的视觉-语言模型进行了全面的基准实验。实验结果表明,结合我们的数据集能显著提升现有视觉-语言模型的性能。我们相信,该数据集与基准将作为推动更鲁棒的医学影像视觉-语言模型发展的关键一步,特别是针对低资源语言及越南医疗保健的临床应用。源代码发布于https://github.com/AIoT-Lab-BKAI/ViPET-ReportGen。