Creating scientific publications is a complex process, typically composed of a number of different activities, such as designing the experiments, data preparation, programming software and writing and editing the manuscript. The information about the contributions of individual authors of a paper is important in the context of assessing authors' scientific achievements. Some publications in biomedical disciplines contain a description of authors' roles in the form of a short section written in natural language, typically entitled "Authors' contributions". In this paper, we present an analysis of roles commonly appearing in the content of these sections, and propose an algorithm for automatic extraction of authors' roles from natural language text in scientific publications. During the first part of the study, we used clustering techniques, as well as Open Information Extraction (OpenIE), to semi-automatically discover the most popular roles within a corpus of 2,000 contributions sections obtained from PubMed Central resources. The roles discovered by our approach include: experimenting (1,743 instances, 17% of the entire role set within the corpus), analysis (1,343, 16%), study design (1,132, 13%), interpretation (879, 10%), conceptualization (865, 10%), paper reading (823, 10%), paper writing (724, 8%), paper review (501, 6%), paper drafting (351, 4%), coordination (319, 4%), data collection (76, 1%), paper review (41, 0.5%) and literature review (41, 0.5%). Discovered roles were then used to automatically build a training set for the supervised role extractor, based on Naive Bayes algorithm. According to the evaluation we performed, the proposed role extraction algorithm is able to extract the roles from the text with precision 0.71, recall 0.49 and F1 0.58.
翻译:创建科学出版物是一个复杂的过程,通常由一系列不同的活动组成,例如设计实验、数据编制、编程软件以及写作和编辑手稿。关于论文作者个人贡献的信息在评估作者科学成就方面很重要。生物医学学科的一些出版物以自然语言编写的简短章节形式,通常题为“作者的贡献”的形式,描述了作者的作用。在本文件中,我们分析了这些章节内容中通常出现的角色,并提出了从科学出版物的自然语言文本中自动提取作者角色的计算法。在研究的第一部分,我们使用了对作者个人贡献的分组技术以及Open Information Expresson(OpenIE),在评估作者科学成就方面非常重要。一些生物医学学科中的一些出版物用自然语言编写的简短章节介绍了作者的作用,通常题为“作者的贡献”。我们的方法包括:实验(1 743例,占全部角色的17%)、分析(1 343, 16%)、 研究设计(1,132, 13%)、解释(879,10%)、概念化(865,10%)、角色组阅读(823)中最受欢迎的角色、8 %的论文(724),用于起草的论文(5 %),使用的文件(51%),41%),从论文(51%、41%),从论文的检索(5、4)。