Vision-language models (VLMs) have recently shown remarkable zero-shot performance in medical image understanding, yet their grounding ability, the extent to which textual concepts align with visual evidence, remains underexplored. In the medical domain, however, reliable grounding is essential for interpretability and clinical adoption. In this work, we present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays across seven CLIP-style VLM variants. We generate visual explanations using cross-attention and similarity-based localization maps, and quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies. Our analysis reveals that: (1) while all VLM variants demonstrate reasonable localization for large and well-defined pathologies, their performance substantially degrades for small or diffuse lesions; (2) models that are pretrained on chest X-ray-specific datasets exhibit improved alignment compared to those trained on general-domain data. (3) The overall recognition ability and grounding ability of the model are strongly correlated. These findings underscore that current VLMs, despite their strong recognition ability, still fall short in clinically reliable grounding, highlighting the need for targeted interpretability benchmarks before deployment in medical practice. XBench code is available at https://github.com/Roypic/Benchmarkingattention
翻译:视觉-语言模型(VLMs)近期在医学图像理解任务中展现出卓越的零样本性能,然而其基础能力——即文本概念与视觉证据的对齐程度——仍未得到充分探索。在医学领域,可靠的基础能力对于模型可解释性及临床实际应用至关重要。本研究首次提出了系统性基准,用于评估七种CLIP风格VLM变体在胸部X光片中的跨模态可解释性。我们通过交叉注意力机制与基于相似性的定位图生成视觉解释,并定量评估其与放射科医生标注的多种病理区域之间的对齐程度。分析结果表明:(1)尽管所有VLM变体对大型且边界清晰的病理区域均表现出合理的定位能力,但其对小型或弥散性病灶的性能显著下降;(2)基于胸部X光专用数据集预训练的模型,相较于通用领域数据训练的模型,展现出更优的对齐效果;(3)模型的整体识别能力与其基础能力呈现强相关性。这些发现表明,尽管当前VLM具备强大的识别能力,但在临床可靠的基础能力方面仍存在不足,这凸显了在医疗实践部署前建立针对性可解释性基准的必要性。XBench代码已发布于 https://github.com/Roypic/Benchmarkingattention