BioBERT:生物医学文字采矿的预先培训生物医学语言代表模式 (BioBERT: a pre-trained biomedical language representation model for biomedical text mining)

Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in machine learning, extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, as deep learning models require a large amount of training data, applying deep learning to biomedical text mining is often unsuccessful due to the lack of training data in biomedical fields. Recent researches on training contextualized language representation models on text corpora shed light on the possibility of leveraging a large number of unannotated biomedical text corpora. We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain specific language representation model pre-trained on large-scale biomedical corpora. Based on the BERT architecture, BioBERT effectively transfers the knowledge from a large amount of biomedical texts to biomedical text mining models with minimal task-specific architecture modifications. While BERT shows competitive performances with previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.51% absolute improvement), biomedical relation extraction (3.49% absolute improvement), and biomedical question answering (9.61% absolute improvement). We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.

翻译：随着生物医学文献的迅速增加,生物医学文本的开采变得越来越重要。随着机器学习的进展,生物医学文献中的宝贵信息在研究人员中越来越受欢迎,深层次学习推动了有效的生物医学文本采矿模式的发展。然而,由于深层次学习模式需要大量的培训数据,因此将深层学习应用于生物医学文本采矿往往由于缺乏生物医学领域的培训数据而不能成功。最近对教科书中背景化语言代表模式的培训研究揭示了利用大量未经注解的生物医学文本公司的可能性。我们引入了生物生物生物伦理(生物医学文献的转化者对生物医学文献的描述),这是一种特定领域的语言代表模式,在大规模生物医学材料开采方面事先经过了培训。根据生物伦理研究小组的结构,生物生物生物伦理研究小组有效地将大量生物医学文本的知识转移给生物医学文本采矿模式,而具体任务结构的修改极少。生物伦理研究小组展示了先前的州级生物医学模型的竞争性表现,生物伦理资源资源评估在以下三个具有代表性的生物医学文献中大大优于这些:生物医学文献:生物医学生物医学-生物医学改革变异(生物医学-生物医学-生物医学的绝对比例),在生物医学-生物医学研究前的改进(生物医学-生物医学-生物医学-生物物理-生物医学-生物学-生物物理-生物学-生物学-生物学-生物学-生物物理-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物物理-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物物理-生物学-生物物理-生物物理-生物学-生物学-生物学-生物物理-生物学-生物物理-生物学-生物研究-生物物理-生物学-生物学-生物物理-生物物理-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物学-生物物理-生物物理-生物学-生物物理-生物物理学-生物物理学-生物学-生物物理-生物物理-生物物理