Clinical diagnosis begins with doctor-patient interaction, during which physicians iteratively gather information, determine examination and refine differential diagnosis through patients' response. This dynamic clinical-reasoning process is poorly represented by existing LLM benchmarks that focus on static question-answering. To mitigate these gaps, recent methods explore dynamic medical frameworks involving interactive clinical dialogues. Although effective, they often rely on limited, contamination-prone datasets and lack granular, multi-level evaluation. In this work, we propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues. Grounded in a disease knowledge graph, our method dynamically generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent. Our evaluation protocol goes beyond diagnostic accuracy by incorporating fine-grained efficiency analysis and rubric-based assessment of diagnostic quality. Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.
翻译:临床诊断始于医患互动,在此过程中医生通过患者反馈迭代收集信息、确定检查项目并完善鉴别诊断。现有专注于静态问答的大语言模型基准难以体现这一动态临床推理过程。为弥补这些不足,近期研究开始探索涉及交互式临床对话的动态医疗框架。这些方法虽具成效,但常依赖有限且易受污染的数据集,并缺乏细粒度的多层级评估。本研究提出ClinDEF,一个通过模拟诊断对话评估大语言模型临床推理能力的动态框架。基于疾病知识图谱,本方法动态生成患者病例,并促进基于大语言模型的医生与自动化患者代理之间的多轮交互。我们的评估协议不仅关注诊断准确性,更融合细粒度效率分析及基于量规的诊断质量评估。实验表明,ClinDEF能有效揭示前沿大语言模型的关键临床推理缺陷,为临床评估提供更精细且具有临床意义的评估范式。