专题隐藏基因组:采用巴耶斯多层次背景学习方法发现本端癌症突变专题 (Topical Hidden Genome: Discovering Latent Cancer Mutational Topics using a Bayesian Multilevel Context-learning Approach)

Statistical inference on the cancer-site specificities of collective ultra-rare whole genome somatic mutations is an open problem. Traditional statistical methods cannot handle whole-genome mutation data due to their ultra-high-dimensionality and extreme data sparsity -- e.g., >30 million unique variants are observed in the ~1700 whole-genome tumor dataset considered herein, of which >99% variants are encountered only once. To harness information in these rare variants we have recently proposed the "hidden genome model", a formal multilevel multi-logistic model that mines information in ultra-rare somatic variants to characterize tumor types. The model condenses signals in rare variants through a hierarchical layer leveraging contexts of individual mutations. The model is currently implemented using consistent, scalable point estimation techniques that can handle 10s of millions of variants detected across thousands of tumors. Our recent publications have evidenced its impressive accuracy and attributability at scale. However, principled statistical inference from the model is infeasible due to the volume, correlation, and non-interpretability of the mutation contexts. In this paper we propose a novel framework that leverages topic models from the field of computational linguistics to induce an *interpretable dimension reduction* of the mutation contexts used in the model. The proposed model is implemented using an efficient MCMC algorithm that permits rigorous full Bayesian inference at a scale that is orders of magnitude beyond the capability of out-of-the-box high-dimensional multi-class regression methods and software. We employ our model on the Pan Cancer Analysis of Whole Genomes (PCAWG) dataset, and our results reveal interesting novel insights.

翻译：传统统计方法无法处理全基因突变数据,因为其超高维度和极端数据宽度 -- -- 例如,在本文所考虑的~1700全基因肿瘤数据集中,观察到了3 000万个独特的变量,其中>99 % 变量只遇到过一次。为了利用这些稀有变量中的信息,我们最近提议了“隐藏基因组模型”这一正式的多层次多逻辑模型,这种模型将超光谱变异的信息从超光谱变异中挖掘出来,以描述肿瘤类型。模型压缩稀有变异的信号,通过等级层调节个体变异环境。目前采用一致、可变点估算技术,可以处理在成千上万肿瘤中检测到的10万种变异。我们最近的出版物证明了其令人印象深刻的准确性和在规模上可建模性。然而,从模型中得出的有原则性统计推论性推理性推理的多层次模型,超出了用于描述超光谱的细胞变异变异变异变异性变异性变异性变异性模型,在精确性变异性变异性变异性变异性变变变异性变异性变异性变变变变变变变变变变法的模型中,在精确变异性变变变变异性变异性变变变变变变变变变变变变法的变法的变变变异性变法的变变法的变变变法下,在精确性变变变法的变法下,在精确变变法的变法的变变法下,在精确变法的变变变变变变法的变变变变法的变法的变法的变法的变变变变法的变变变变变变法的变法的变法的变法的变法的变法下演化变法下变法下变法下演变法的变法下演变法下演变法的变法的变法的变法的变法的变法的变法的变法的变法的变法的变法下演变法变法变法的变法的变法的变法的变法的变法的变法的变法的变法的变法变法变法的变法的变法的变法化变法下演变法下

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日