Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression (GE) data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional GE data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient's cancer, we represent each patient (~document) as a mixture over cancer-topics, where each cancer-topic is a mixture over GE values (~words). This required some extensions to the standard LDA model eg: to accommodate the "real-valued" expression values - leading to our novel "discretized" Latent Dirichlet Allocation (dLDA) procedure. We initially focus on the METABRIC dataset, which describes breast cancer patients using the r=49,576 GE values, from microarrays. Our results show that our approach provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this approach by running it on the Pan-kidney (KIPAN) dataset, over r=15,529 GE values - here using the mRNAseq modality - and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent "D-calibrated" measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach.
翻译:癌症是全世界死亡的主要原因之一。 许多人相信基因组数据将使我们能够更好地预测这些病人的存活时间, 从而导致更好的、 更个性化的治疗选项和病人护理。 由于标准的生存预测模型很难适应基因表达( GE) 数据的高维性, 许多项目都使用某些维度减少技术来克服这个障碍。 我们引入了一种新颖的方法, 受自然语言域主题模型的启发, 从高维的 GE 数据中得出表解性特征。 其中, 一份文件代表着相对较少的话题的混合体, 每个话题都与字词的分布相匹配; 这里, 为了适应病人癌症的遗传性, 我们代表每个病人( ~ 文档) 的混合体格, 每个癌症的分解性技术都用来克服 GE值( 词) 。 这需要从标准的LDA 模型中进行一些扩展, 例如: 适应“ 真正估价” 表达式的表达法 - 导致我们新的“ 分解性化” 延迟的表达方式, 每个主题都对应词的分布式( dRDAA) 。 我们最初用的是“ 正在运行的计算中的数据 显示我们的标准数据 。