Many AI researchers are publishing code, data and other resources that accompany their papers in GitHub repositories. In this paper, we refer to these repositories as academic AI repositories. Our preliminary study shows that highly cited papers are more likely to have popular academic AI repositories (and vice versa). Hence, in this study, we perform an empirical study on academic AI repositories to highlight good software engineering practices of popular academic AI repositories for AI researchers. We collect 1,149 academic AI repositories, in which we label the top 20% repositories that have the most number of stars as popular, and we label the bottom 70% repositories as unpopular. The remaining 10% repositories are set as a gap between popular and unpopular academic AI repositories. We propose 21 features to characterize the software engineering practices of academic AI repositories. Our experimental results show that popular and unpopular academic AI repositories are statistically significantly different in 11 of the studied features---indicating that the two groups of repositories have significantly different software engineering practices. Furthermore, we find that the number of links to other GitHub repositories in the README file, the number of images in the README file and the inclusion of a license are the most important features for differentiating the two groups of academic AI repositories. Our dataset and code are made publicly available to share with the community.
翻译:许多大赦国际研究人员正在GitHub 库中发表其论文的代码、数据和其他资源。本文中,我们将这些储存库称为学术AI储存库。我们的初步研究显示,大量引用的文件更有可能拥有受欢迎的AI储存库(反之亦然)。因此,我们在本研究中对学术AI储存库进行了经验研究,以突出大赦国际研究人员流行的AI储存库的良好软件工程做法。我们收集了1 149个学术AI储存库,其中我们把最前20%的恒星数量标为最受欢迎的,我们把底部70%的储存库标为不受欢迎的。其余10%的储存库被设为受欢迎和不受欢迎的AI储存库之间的空白。我们提出了21个特征来描述学术AI储存库的软件工程做法。我们的实验结果显示,在所研究的11个特征中,流行和不受欢迎的AI储存库在统计上差别很大,表明这两个储存库的软件工程做法大不相同。此外,我们发现在README档案中与其他GitHub储存库的链接数量是最重要的,RADME档案中图像的数目和我们所拥有的学术数据库中最重要的部分。