Differentiating true tumor progression (TP) from treatment-related pseudoprogression (PsP) in glioblastoma remains challenging, especially at early follow-up. We present the first stage-specific, cross-sectional benchmarking of deep learning models for follow-up MRI using the Burdenko GBM Progression cohort (n = 180). We analyze different post-RT scans independently to test whether architecture performance depends on time-point. Eleven representative DL families (CNNs, LSTMs, hybrids, transformers, and selective state-space models) were trained under a unified, QC-driven pipeline with patient-level cross-validation. Across both stages, accuracies were comparable (~0.70-0.74), but discrimination improved at the second follow-up, with F1 and AUC increasing for several models, indicating richer separability later in the care pathway. A Mamba+CNN hybrid consistently offered the best accuracy-efficiency trade-off, while transformer variants delivered competitive AUCs at substantially higher computational cost and lightweight CNNs were efficient but less reliable. Performance also showed sensitivity to batch size, underscoring the need for standardized training protocols. Notably, absolute discrimination remained modest overall, reflecting the intrinsic difficulty of TP vs. PsP and the dataset's size imbalance. These results establish a stage-aware benchmark and motivate future work incorporating longitudinal modeling, multi-sequence MRI, and larger multi-center cohorts.
翻译:在胶质母细胞瘤中区分真实肿瘤进展(TP)与治疗相关假性进展(PsP)仍然具有挑战性,尤其在早期随访阶段。我们利用Burdenko GBM进展队列(n = 180)首次提出了针对随访MRI的深度学习模型的阶段特异性横断面基准测试。我们独立分析放疗后的不同扫描影像,以检验模型架构性能是否依赖于时间点。在统一的质量控制驱动流程下,采用患者层面的交叉验证,对11个代表性深度学习家族(CNN、LSTM、混合模型、Transformer及选择性状态空间模型)进行了训练。在两个阶段中,模型准确率相当(约0.70-0.74),但在第二次随访时判别性能有所提升,多个模型的F1分数和AUC值均出现增长,表明在治疗路径后期具有更丰富的可分离性。Mamba+CNN混合模型始终提供最佳的准确率-效率平衡,而Transformer变体虽能以显著更高的计算成本获得具有竞争力的AUC值,轻量级CNN模型效率较高但可靠性较低。模型性能还表现出对批处理大小的敏感性,这凸显了标准化训练协议的必要性。值得注意的是,整体绝对判别性能仍处于中等水平,这反映了TP与PsP区分的固有难度以及数据集的规模不平衡问题。这些结果建立了一个具有阶段感知能力的基准,并为未来结合纵向建模、多序列MRI及更大规模多中心队列的研究提供了方向。