Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines. However, their effectiveness with image-based queries is limited, as general Vision-Language Model captions often lack clinical specificity and factual grounding. This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions that serve as superior queries. To overcome data scarcity, we employ a knowledge distillation pipeline to create a synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance was rigorously assessed through a dual framework measuring both classification accuracy and, via a novel application of the RAGAS framework, caption faithfulness, relevancy, and correctness. The fine-tuned model demonstrated substantial improvements in classification performance, while RAGAS evaluation confirmed significant gains in caption faithfulness and correctness, validating the models ability to produce reliable, factually grounded descriptions. This work establishes a robust pipeline for specializing medical VLMs and validates the resulting model as a high-quality query generator, laying the groundwork for enhancing multimodal RAG systems in evidence-based clinical decision support.
翻译:检索增强生成系统对于从马来西亚临床实践指南中提供基于事实的指导至关重要。然而,其在处理基于图像的查询时效果有限,因为通用视觉语言模型生成的描述通常缺乏临床特异性与事实依据。本研究提出并验证了一个框架,用于专门化MedGemma模型以生成高保真度的描述,作为更优的查询输入。为克服数据稀缺问题,我们采用知识蒸馏流程构建了涵盖皮肤病学、眼底影像及胸部放射学领域的合成数据集,并使用参数高效的QLoRA方法对MedGemma进行微调。通过双评估框架对性能进行严格评估:一方面衡量分类准确性,另一方面创新性地应用RAGAS框架评估描述的真实性、相关性和正确性。微调后的模型在分类性能上展现出显著提升,而RAGAS评估证实其在描述真实性与正确性方面取得重大改进,验证了该模型生成可靠、基于事实的描述的能力。本研究建立了一个专门化医学视觉语言模型的稳健流程,并验证了所得模型作为高质量查询生成器的有效性,为增强循证临床决策支持中的多模态检索增强生成系统奠定了基础。