General-purpose VLMs demonstrate impressive capabilities, but their opaque training on uncurated internet data poses critical limitations for high-stakes decision-making, such as in neurosurgery. We present CNS-Obsidian, a neurosurgical VLM trained on peer-reviewed literature, and demonstrate its clinical utility versus GPT-4o in a real-world setting. We compiled 23,984 articles from Neurosurgery Publications journals, yielding 78,853 figures and captions. Using GPT-4o and Claude Sonnet-3.5, we converted these into 263,064 training samples across three formats: instruction fine-tuning, multiple-choice questions, and differential diagnosis. We trained CNS-Obsidian, a fine-tune of the 34-billion parameter LLaVA-Next model. In a blinded, randomized trial at NYU Langone Health (Aug 30-Nov 30, 2024), neurosurgery consultations were assigned to either CNS-Obsidian or a HIPAA-compliant GPT-4o endpoint as diagnostic co-pilot after consultations. Primary outcomes were diagnostic helpfulness and accuracy, assessed via user ratings and presence of correct diagnosis within the VLM-provided differential. CNS-Obsidian matched GPT-4o on synthetic questions (76.13% vs 77.54%, p=0.235), but only achieved 46.81% accuracy on human-generated questions versus GPT-4o's 65.70% (p<10-15). In the randomized trial, 70 consultations were evaluated (32 CNS-Obsidian, 38 GPT-4o) from 959 total consults (7.3% utilization). CNS-Obsidian received positive ratings in 40.62% of cases versus 57.89% for GPT-4o (p=0.230). Both models included correct diagnosis in approximately 60% of cases (59.38% vs 65.79%, p=0.626). Domain-specific VLMs trained on curated scientific literature can approach frontier model performance despite being orders of magnitude smaller and less expensive to train. This establishes a transparent framework for scientific communities to build specialized AI models.
翻译:通用视觉语言模型(VLM)展现出令人印象深刻的能力,但其基于未经筛选的互联网数据进行的不透明训练,对于神经外科等高风险决策场景存在关键限制。我们提出了CNS-Obsidian,一个基于同行评议文献训练的神经外科VLM,并在真实临床环境中对比了其与GPT-4o的临床效用。我们收集了《Neurosurgery Publications》期刊的23,984篇文章,得到78,853个图表及标题。利用GPT-4o和Claude Sonnet-3.5,我们将其转换为涵盖三种格式的263,064个训练样本:指令微调、多项选择题和鉴别诊断。我们训练了CNS-Obsidian,该模型是基于340亿参数LLaVA-Next模型的微调版本。在纽约大学朗格尼健康中心(2024年8月30日至11月30日)进行的一项盲法随机试验中,神经外科会诊后被随机分配使用CNS-Obsidian或符合HIPAA标准的GPT-4o端点作为诊断辅助工具。主要结局指标为诊断帮助性和准确性,通过用户评分及VLM提供的鉴别诊断列表中是否包含正确诊断进行评估。CNS-Obsidian在合成问题上与GPT-4o表现相当(76.13% vs 77.54%,p=0.235),但在人工生成问题上仅达到46.81%的准确率,而GPT-4o为65.70%(p<10⁻¹⁵)。在随机试验中,从959次总会诊(使用率7.3%)中评估了70次会诊(32次CNS-Obsidian,38次GPT-4o)。CNS-Obsidian在40.62%的案例中获得积极评分,而GPT-4o为57.89%(p=0.230)。两种模型在约60%的案例中包含了正确诊断(59.38% vs 65.79%,p=0.626)。尽管规模小数个数量级且训练成本显著更低,基于精选科学文献训练的领域特定VLM仍能接近前沿模型的性能。这为科学界构建专用人工智能模型建立了一个透明框架。