Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a "hide-in-plain-sight" method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.
翻译:目的:通过利用大规模训练数据集扩展基于Transformer的模型,并与商业云端供应商系统在受保护健康信息检测性能上进行基准测试,以增强放射学报告的自动化去标识化能力。材料与方法:在这项回顾性研究中,我们基于一种先进的、基于Transformer的PHI去标识化流程,在两个来自斯坦福大学的大型标注放射学语料库(涵盖胸部X光、胸部CT、腹部/骨盆CT及脑部MR报告)上进行微调,并在架构中引入了一个额外的PHI类别(AGE)。模型性能在斯坦福大学和宾夕法尼亚大学的测试集上进行了令牌级PHI检测评估。我们进一步评估了(1)使用“隐藏于众目睽睽之下”方法生成合成PHI的稳定性,以及(2)与商业系统的性能对比。计算了所有PHI类别的精确率、召回率和F1分数。结果:我们的模型在宾夕法尼亚大学数据集上实现了0.973的总体F1分数,在斯坦福大学数据集上实现了0.996的总体F1分数,优于或保持了先前最先进模型的性能。合成PHI评估显示,在50个独立去标识化的宾夕法尼亚大学数据集中,检测一致性良好(总体F1:0.959 [0.958-0.960])。我们的模型在合成宾夕法尼亚大学报告上优于所有供应商系统(总体F1:0.960 对比 0.632-0.754)。讨论:大规模、多模态训练提高了跨机构泛化能力和鲁棒性。合成PHI生成在确保隐私的同时保持了数据效用。结论:基于Transformer的去标识化模型在多样化放射学数据集上训练后,在PHI检测方面优于先前的学术和商业系统,并为安全的临床文本处理设立了新的基准。