Clinical notes are often stored in unstructured or semi-structured formats after extraction from electronic medical record (EMR) systems, which complicates their use for secondary analysis and downstream clinical applications. Reliable identification of section boundaries is a key step toward structuring these notes, as sections such as history of present illness, medications, and discharge instructions each provide distinct clinical contexts. In this work, we evaluate rule-based baselines, domain-specific transformer models, and large language models for clinical note segmentation using a curated dataset of 1,000 notes from MIMIC-IV. Our experiments show that large API-based models achieve the best overall performance, with GPT-5-mini reaching a best average F1 of 72.4 across sentence-level and freetext segmentation. Lightweight baselines remain competitive on structured sentence-level tasks but falter on unstructured freetext. Our results provide guidance for method selection and lay the groundwork for downstream tasks such as information extraction, cohort identification, and automated summarization.
翻译:临床记录从电子病历(EMR)系统中提取后,常以非结构化或半结构化格式存储,这给其用于二次分析和下游临床应用带来了困难。可靠识别章节边界是结构化这些记录的关键步骤,因为诸如现病史、用药情况和出院指导等章节各自提供不同的临床语境。本研究使用来自MIMIC-IV的1000份精编记录数据集,评估了基于规则的基线方法、领域专用Transformer模型以及大型语言模型在临床记录分割任务上的表现。实验表明,基于API的大型模型取得了最佳整体性能,其中GPT-5-mini在句子级和自由文本分割任务中达到72.4的平均F1值。轻量级基线方法在结构化句子级任务上仍具竞争力,但在非结构化自由文本上表现欠佳。本研究结果为方法选择提供了指导,并为信息抽取、队列识别和自动摘要等下游任务奠定了基础。