Developing accurate clinical prediction models is often bottlenecked by the difficulty of deriving meaningful structured features from unstructured EHR notes, a process that traditionally requires manual, unscalable clinical abstraction. In this study, we first established a rigorous patient-level Clinician Feature Generation (CFG) protocol, in which domain experts manually reviewed notes to define and extract nuanced features for a cohort of 147 patients with prostate cancer. As a high-fidelity ground truth, this labor-intensive process provided the blueprint for SNOW (Scalable Note-to-Outcome Workflow), a transparent multi-agent large language model (LLM) system designed to autonomously mimic the iterative reasoning and validation workflow of clinical experts. On 5-year cancer recurrence prediction, SNOW (AUC-ROC 0.767) achieved performance comparable to manual CFG (0.762) and outperformed structured baselines, clinician-guided LLM extraction, and six representational feature generation (RFG) approaches. Once configured, SNOW produced the full patient-level feature table in 12 hours with 5 hours of clinician oversight, reducing human expert effort by approximately 48-fold versus manual CFG. To test scalability where manual CFG is infeasible, we deployed SNOW on an external heart failure with preserved ejection fraction (HFpEF) cohort from MIMIC-IV (n=2,084); without task-specific tuning, SNOW generated prognostic features that outperformed baseline and RFG methods for 30-day (SNOW: 0.851) and 1-year (SNOW: 0.763) mortality prediction. These results demonstrate that a modular LLM agent-based system can scale expert-level feature generation from clinical notes, while enabling interpretable use of unstructured EHR text in outcome prediction and preserving generalizability across a variety of settings and conditions.
翻译:开发准确的临床预测模型常常受限于从非结构化电子健康记录(EHR)笔记中提取有意义的结构化特征的困难,这一过程传统上需要手动、难以规模化的临床抽象。在本研究中,我们首先建立了一个严格的患者级临床医生特征生成(CFG)协议,领域专家通过该协议手动审阅笔记,为147名前列腺癌患者队列定义并提取了精细特征。这一劳动密集型过程提供了高保真度的真实基准,并成为SNOW(可扩展的笔记到结果工作流)的设计蓝图。SNOW是一个透明的多智能体大语言模型(LLM)系统,旨在自主模拟临床专家的迭代推理与验证工作流程。在5年癌症复发预测任务上,SNOW(AUC-ROC 0.767)取得了与手动CFG(0.762)相当的性能,并优于结构化基线方法、临床医生引导的LLM提取方法以及六种表征性特征生成(RFG)方法。一旦配置完成,SNOW在12小时内生成了完整的患者级特征表,仅需5小时的临床医生监督,与手动CFG相比,将人类专家工作量减少了约48倍。为了在手动CFG不可行的情况下测试其可扩展性,我们将SNOW部署于来自MIMIC-IV的外部射血分数保留型心力衰竭(HFpEF)队列(n=2,084);无需针对特定任务进行调优,SNOW生成的预后特征在30天(SNOW:0.851)和1年(SNOW:0.763)死亡率预测上均优于基线方法和RFG方法。这些结果表明,基于模块化LLM智能体的系统能够将临床笔记中的专家级特征生成规模化,同时支持在结果预测中可解释地利用非结构化EHR文本,并在多种场景和条件下保持泛化能力。