Foundation models have been successful in natural language processing and computer vision because they are capable of capturing the underlying structures (foundation) of natural languages. However, in medical imaging, the key foundation lies in human anatomy, as these images directly represent the internal structures of the body, reflecting the consistency, coherence, and hierarchy of human anatomy. Yet, existing self-supervised learning (SSL) methods often overlook these perspectives, limiting their ability to effectively learn anatomical features. To overcome the limitation, we built Lamps (learning anatomy from multiple perspectives via self-supervision) pre-trained on large-scale chest radiographs by harmoniously utilizing the consistency, coherence, and hierarchy of human anatomy as the supervision signal. Extensive experiments across 10 datasets evaluated through fine-tuning and emergent property analysis demonstrate Lamps' superior robustness, transferability, and clinical potential when compared to 10 baseline models. By learning from multiple perspectives, Lamps presents a unique opportunity for foundation models to develop meaningful, robust representations that are aligned with the structure of human anatomy.
翻译:基础模型在自然语言处理和计算机视觉领域取得了成功,因为它们能够捕捉自然语言的内在结构(基础)。然而,在医学影像领域,关键的基础在于人体解剖结构,因为这些图像直接反映了人体的内部结构,体现了人体解剖结构的一致性、连贯性和层次性。然而,现有的自监督学习方法往往忽视了这些视角,限制了其有效学习解剖特征的能力。为克服这一局限,我们构建了Lamps(通过自监督从多视角学习解剖结构),通过协调利用人体解剖结构的一致性、连贯性和层次性作为监督信号,在大规模胸部X光片上进行了预训练。通过在10个数据集上进行微调和涌现特性分析的广泛实验表明,与10个基线模型相比,Lamps在鲁棒性、可迁移性和临床潜力方面均表现出优越性。通过从多视角学习,Lamps为基础模型提供了一个独特的机会,以开发与人体解剖结构相一致的、有意义且鲁棒的表示。