Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.
翻译:电子健康记录(EHRs)蕴含丰富而复杂的信息,其自动化分析对于临床决策至关重要。尽管大语言模型(LLMs)在临床工作流中取得了最新进展,但由于任务覆盖范围狭窄且缺乏面向EHR的推理能力,它们在分析EHR方面的能力仍然有限。本文旨在弥合这一差距,具体而言,我们提出了EHR-Ins,一个大规模、全面的EHR推理指令数据集,包含跨越42个不同EHR任务的30万个高质量推理案例和400万个非推理案例。其核心创新是一个思维图驱动的框架,能够大规模生成高质量推理数据。基于此,我们开发了EHR-R1,一系列专为EHR分析定制的、参数规模高达720亿的推理增强型LLMs。通过包含领域适应、推理增强和强化学习的多阶段训练范式,EHR-R1系统地获取了领域知识和多样化推理能力,从而实现准确且稳健的EHR分析。最后,我们引入了EHR-Bench,这是一个从MIMIC-IV中精心构建的新基准,涵盖42个任务,用于全面评估跨EHR场景的推理和预测能力。在实验中,我们表明所得到的EHR-R1在性能上持续超越最先进的商业和开源LLMs(包括DeepSeek-V3和GPT-4o),在MIMIC-Bench上超过GPT-4o超过30分,并在EHRSHOT上实现了10%更高的零样本AUROC。总体而言,EHR-Ins、EHR-R1和EHR-Bench显著推动了更可靠且更具临床相关性的EHR分析的发展。