Domain-adapted open-weight large language models (LLMs) offer promising healthcare applications, from queryable knowledge bases to multimodal assistants, with the crucial advantage of local deployment for privacy preservation. However, optimal adaptation strategies, evaluation methodologies, and performance relative to general-purpose LLMs remain poorly characterized. We investigated these questions in electrocardiography, an important area of cardiovascular medicine, by finetuning open-weight models on domain-specific literature and implementing a multi-layered evaluation framework comparing finetuned models, retrieval-augmented generation (RAG), and Claude Sonnet 3.7 as a representative general-purpose model. Finetuned Llama 3.1 70B achieved superior performance on multiple-choice evaluations and automatic text metrics, ranking second to Claude 3.7 in LLM-as-a-judge assessments. Human expert evaluation favored Claude 3.7 and RAG approaches for complex queries. Finetuned models significantly outperformed their base counterparts across nearly all evaluation modes. Our findings reveal substantial performance heterogeneity across evaluation methodologies, underscoring assessment complexity. Nevertheless, domain-specific adaptation through finetuning and RAG achieves competitive performance with proprietary models, supporting the viability of privacy-preserving, locally deployable clinical solutions.
翻译:领域适配的开源权重大语言模型在医疗健康领域展现出广阔的应用前景,从可查询知识库到多模态助手,其关键优势在于可本地部署以保护隐私。然而,其最优适配策略、评估方法以及与通用大语言模型相比的性能表现仍缺乏充分研究。我们以心血管医学的重要领域——心电图学为研究对象,通过在领域特定文献上微调开源权重模型,并实施多层评估框架,比较了微调模型、检索增强生成以及作为代表性通用模型的Claude Sonnet 3.7。微调后的Llama 3.1 70B模型在多项选择题评估和自动文本指标上表现优异,在LLM-as-a-judge评估中仅次于Claude 3.7。人类专家评估则更倾向于Claude 3.7和检索增强生成方法处理复杂查询。微调模型在几乎所有评估模式中都显著优于其基础版本。我们的研究结果揭示了不同评估方法间显著的性能异质性,凸显了评估的复杂性。尽管如此,通过微调和检索增强生成实现的领域特定适配,能够与专有模型达到竞争性性能,这支持了保护隐私、可本地部署的临床解决方案的可行性。