Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20K pre-training data, MedCLIP wins over the state-of-the-art method (using around 200K data). Our code is available at https://github.com/RyanWangZf/MedCLIP.
翻译:类似CLIP(CLIP)这样的现有视觉-文字对比式学习,旨在匹配配对图像和字幕嵌入,同时将其他图像和字幕分开,从而提高代表性可传输性,支持零点预测。然而,医疗图像文本数据集在互联网一般图像和字幕下是数量级的。此外,以往方法遇到许多虚假的负数,即不同病人的图像和报告可能带有相同的语义,但被错误地作为负数处理。在本文中,我们拆分了多式对比学习的图像和文本,从而以低成本的方式在组合规模上扩大可用的培训数据。我们还提议以基于医学知识的语义匹配损失取代InfoNCE损失,以消除对比性学习中的虚假负数。我们证明MedCLIP是一个简单而有效的框架:它比零射预测、监管分类和图像文本检索方面最先进的方法要差。令人惊讶的是,我们观察到,只有20K 培训前数据,MCLIP(M/com)赢得了Rart方法(使用约200K/CLOBZ数据)。我们的代码可在 http://CLA/CLA/CRVZ.