Lung cancer is the leading cause of cancer-related mortality in adults worldwide. Screening high-risk individuals with annual low-dose CT (LDCT) can support earlier detection and reduce deaths, but widespread implementation may strain the already limited radiology workforce. AI models have shown potential in estimating lung cancer risk from LDCT scans. However, high-risk populations for lung cancer are diverse, and these models' performance across demographic groups remains an open question. In this study, we drew on the considerations on confounding factors and ethically significant biases outlined in the JustEFAB framework to evaluate potential performance disparities and fairness in two deep learning risk estimation models for lung cancer screening: the Sybil lung cancer risk model and the Venkadesh21 nodule risk estimator. We also examined disparities in the PanCan2b logistic regression model recommended in the British Thoracic Society nodule management guideline. Both deep learning models were trained on data from the US-based National Lung Screening Trial (NLST), and assessed on a held-out NLST validation set. We evaluated AUROC, sensitivity, and specificity across demographic subgroups, and explored potential confounding from clinical risk factors. We observed a statistically significant AUROC difference in Sybil's performance between women (0.88, 95% CI: 0.86, 0.90) and men (0.81, 95% CI: 0.78, 0.84, p < .001). At 90% specificity, Venkadesh21 showed lower sensitivity for Black (0.39, 95% CI: 0.23, 0.59) than White participants (0.69, 95% CI: 0.65, 0.73). These differences were not explained by available clinical confounders and thus may be classified as unfair biases according to JustEFAB. Our findings highlight the importance of improving and monitoring model performance across underrepresented subgroups, and further research on algorithmic fairness, in lung cancer screening.


翻译:肺癌是全球成人癌症相关死亡的主要原因。对高危人群进行年度低剂量CT(LDCT)筛查有助于早期发现并降低死亡率,但广泛实施可能会给本已有限的放射科医疗资源带来压力。AI模型已显示出通过LDCT扫描估计肺癌风险的潜力。然而,肺癌高危人群具有多样性,这些模型在不同人口统计学群体中的性能表现仍是一个悬而未决的问题。在本研究中,我们借鉴JustEFAB框架中关于混杂因素和伦理显著性偏见的考量,评估了两种用于肺癌筛查的深度学习风险估计模型的潜在性能差异和公平性:Sybil肺癌风险模型和Venkadesh21结节风险估计器。我们还检验了英国胸科学会结节管理指南推荐的PanCan2b逻辑回归模型中的差异。两种深度学习模型均基于美国国家肺癌筛查试验(NLST)的数据进行训练,并在预留的NLST验证集上进行评估。我们评估了不同人口统计学亚组的AUROC、敏感性和特异性,并探讨了临床风险因素可能带来的混杂影响。我们观察到Sybil模型在女性(0.88,95% CI:0.86,0.90)和男性(0.81,95% CI:0.78,0.84,p < .001)之间的AUROC存在统计学显著差异。在90%特异性下,Venkadesh21模型对黑人参与者(0.39,95% CI:0.23,0.59)的敏感性低于白人参与者(0.69,95% CI:0.65,0.73)。这些差异无法用现有的临床混杂因素解释,因此根据JustEFAB框架可能被归类为不公平的偏见。我们的研究结果强调了在肺癌筛查中改进和监测模型在代表性不足亚组中的性能,以及进一步研究算法公平性的重要性。

0
下载
关闭预览

相关内容

专知会员服务
28+阅读 · 2021年1月29日
【CIKM2020】多模态知识图谱推荐系统,Multi-modal KG for RS
专知会员服务
98+阅读 · 2020年8月24日
国家自然科学基金
0+阅读 · 2016年12月31日
国家自然科学基金
0+阅读 · 2015年12月31日
国家自然科学基金
0+阅读 · 2015年12月31日
VIP会员
相关VIP内容
相关基金
国家自然科学基金
0+阅读 · 2016年12月31日
国家自然科学基金
0+阅读 · 2015年12月31日
国家自然科学基金
0+阅读 · 2015年12月31日
Top
微信扫码咨询专知VIP会员