Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstrate this, we use 64,181 months of data from 1,159 GitHub projects to make various predictions about the recent status of those projects (as of April 2020). We find that traditional estimation algorithms make many mistakes. Algorithms like $k$-nearest neighbors (KNN), support vector regression (SVR), random forest (RFT), linear regression (LNR), and regression trees (CART) have high error rates. But that error rate can be greatly reduced using hyperparameter optimization. To the best of our knowledge, this is the largest study yet conducted, using recent data for predicting multiple health indicators of open-source projects.
翻译:在公共平台上开发的软件是可用于预测这些项目的数据来源。 虽然个体开发活动可能是随机的,而且很难预测,但当大型开发者群体共同开展软件项目时,项目一级正在发展的行为是可以准确预测的。为了证明这一点,我们使用来自1,159 GitHub 项目的64,181个月的数据对这些项目的最近状况做出各种预测(截至2020年4月)。我们发现传统估算算法有许多错误。 Algorithms, 如$k$-earest near near near news(KNN),支持矢量回归(SVR)、随机森林(RFT)、线性回归(LNR)和回归树(CART)的错误率很高。但是,使用超光度优化可以大大降低错误率。 据我们所知,这是目前利用最新数据预测多种开放源项目的健康指标而进行的最大研究。