邀请嘉宾： 韩旭 清华大学计算机系17级博士研究生，来自清华大学自然语言处理组，由刘知远副教授指导，主要研究方向为自然语言处理及信息抽取。目前已在人工智能、自然语言处理等领域的著名国际会议ACL，EMNLP，NAACL，COLING，AAAI发表相关论文多篇，在Github上维护开源工程多项。
ORCID is a scientific infrastructure created to solve the problem of author name ambiguity. Over the years ORCID has also become a useful source for studying academic activities reported by researchers. Our objective in this research was to use ORCID to analyze one of these research activities: the publication of datasets. We illustrate how the identification of datasets that shared in researchers' ORCID profiles enables the study of the characteristics of the researchers who have produced them. To explore the relevance of ORCID to study data sharing practices we obtained all ORCID profiles reporting at least one dataset in their "works" list, together with information related to the individual researchers producing the datasets. The retrieved data was organized and analyzed in a SQL database hosted at CWTS. Our results indicate that DataCite is by far the most important data source for providing information about datasets recorded in ORCID. There is also a substantial overlap between DataCite records with other repositories (Figshare, Dryad, and Zenodo). The analysis of the distribution of researchers producing datasets shows that the top six countries with more data producers, also have a relatively higher percentage of people who have produced datasets out of total researchers with datasets than researchers in the total ORCID. By disciplines, researchers that belong to the areas of Natural Sciences and Medicine and Life Sciences are those with the largest amount of reported datasets. Finally, we observed that researchers who have started their PhD around 2015 published their first dataset earlier that those researchers that started their PhD before. The work concludes with some reflections of the possibilities of ORCID as a relevant source for research on data sharing practices.