Despite the wide adoption of Large Language Models (LLM)s, clinical decision support systems face a critical challenge: achieving high predictive accuracy while generating explanations aligned with the predictions. Current approaches suffer from exposure bias leading to misaligned explanations. We propose Reason2Decide, a two-stage training framework that addresses key challenges in self-rationalization, including exposure bias and task separation. In Stage-1, our model is trained on rationale generation, while in Stage-2, we jointly train on label prediction and rationale generation, applying scheduled sampling to gradually transition from conditioning on gold labels to model predictions. We evaluate Reason2Decide on three medical datasets, including a proprietary triage dataset and public biomedical QA datasets. Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge). In triage, Reason2Decide is rationale source-robust across LLM-generated, nurse-authored, and nurse-post-processed rationales. In our experiments, while using only LLM-generated rationales in Stage-1, Reason2Decide outperforms other fine-tuning variants. This indicates that LLM-generated rationales are suitable for pretraining models, reducing reliance on human annotations. Remarkably, Reason2Decide achieves these gains with models 40x smaller than contemporary foundation models, making clinical reasoning more accessible for resource-constrained deployments while still providing explainable decision support.
翻译:尽管大型语言模型(LLM)已被广泛采用,临床决策支持系统仍面临一个关键挑战:在实现高预测准确性的同时,生成与预测结果相一致的合理解释。现有方法因存在暴露偏差而导致解释与预测结果失准。我们提出Reason2Decide——一个两阶段训练框架,旨在解决自解释任务中的关键挑战,包括暴露偏差和任务分离问题。在第一阶段,模型专注于推理生成训练;在第二阶段,我们联合训练标签预测与推理生成任务,并通过计划采样技术逐步实现从基于标准标签到基于模型预测的条件转换。我们在三个医疗数据集上评估Reason2Decide,包括专有的分诊数据集和公开的生物医学问答数据集。在不同模型规模下,Reason2Decide在预测指标(F1)和推理保真度(BERTScore、BLEU、LLM-as-a-Judge)上均优于其他微调基线模型及部分零样本LLM。在分诊任务中,Reason2Decide对LLM生成、护士撰写及护士后处理的推理文本均表现出良好的鲁棒性。实验表明,尽管第一阶段仅使用LLM生成的推理文本,Reason2Decide仍优于其他微调变体。这说明LLM生成的推理文本适用于模型预训练,可降低对人类标注的依赖。值得注意的是,Reason2Decide以比当代基础模型小40倍的参数量实现了这些优势,使得临床推理在资源受限场景中更具可行性,同时仍能提供可解释的决策支持。