蛋白质:蛋白质预科培训,内嵌基因本体学 (OntoProtein: Protein Pretraining With Gene Ontology Embedding)

Self-supervised protein language models have proved their effectiveness in learning the proteins representations. With the increasing computational power, current protein language models pre-trained with millions of diverse sequences can advance the parameter scale from million-level to billion-level and achieve remarkable improvement. However, those prevailing approaches rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better protein representations. We argue that informative biology knowledge in KGs can enhance protein representation with external knowledge. In this work, we propose OntoProtein, the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models. We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph. We propose novel contrastive learning with knowledge-aware negative sampling to jointly optimize the knowledge graph and protein embedding during pre-training. Experimental results show that OntoProtein can surpass state-of-the-art methods with pre-trained protein language models in TAPE benchmark and yield better performance compared with baselines in protein-protein interaction and protein function prediction. Code and datasets are available in https://github.com/zjunlp/OntoProtein.

翻译：自我监督的蛋白质语言模型已证明了它们在学习蛋白质表示方式方面的有效性。随着计算能力的不断增强,当前蛋白语言模型在数以百万计的不同序列中预先培训的蛋白语言模型能够将参数比例从百万级升至十亿级,并取得显著的改进。然而,这些流行的方法很少考虑纳入知识图表(KGs),这些图表可以为更好的蛋白质表示提供丰富的结构化知识事实。我们争辩说,KGs的知情生物学知识可以用外部知识来增强蛋白质代表形式。在这项工作中,我们提议OntoProtein,这是第一个将GO(Gene Ontology)结构用于蛋白学预培训前模型的总框架。我们建造了一个由GO及其相关蛋白质组成的新型大规模知识图表,以及基因说明文本或蛋白质序列描述了图表中的所有节点。我们提议采用新的对比性学习方法,通过知识认知负面的抽样来共同优化知识图表和蛋白质在培训前的嵌入。实验结果表明,Ontotein可以超过经过事先训练的蛋白质语言模型模型模型模型(Gen-Propractimus)的状态和蛋白质/Proprojustimus 和蛋白质/demaint 的基质/dealdealsideprealsreals。