Before researchers rush to reason across all available data or try complex methods, perhaps it is prudent to first check for simpler alternatives. Specifically, if the historical data has the most information in some small region, then perhaps a model learned from that region would suffice for the rest of the project. To support this claim, we offer a case study with 240 GitHub projects, where we find that the information in those projects "clumped" towards the earliest parts of the project. A defect prediction model learned from just the first 150 commits works as well, or better than state-of-the-art alternatives. Using just this early life cycle data, we can build models very quickly, very early in the software project life cycle. Moreover, using this method, we have shown that a simple model (with just two features) generalizes to hundreds of software projects. Based on this experience, we doubt that prior work on generalizing software engineering defect prediction models may have needlessly complicated an inherently simple process. Further, prior work that focused on later-life cycle data needs to be revisited since their conclusions were drawn from relatively uninformative regions. Replication note: all our data and scripts are online at https://github.com/snaraya7/simplifying-software-analytics
翻译:在研究人员匆忙地对所有现有数据进行解释或尝试复杂的方法之前,也许明智的做法是首先检查更简单的替代方法。具体地说,如果历史数据在某些小区域拥有最多的信息,那么也许一个从该区域学到的模型就足以满足项目的其余部分。为了支持这一主张,我们提供了240个GitHub项目的案例研究,我们发现这些项目中的信息“挤压”到项目的最初部分。仅仅从最初的150个项目中获得的缺陷预测模型也投入了工作,或者比最先进的替代方法更好。仅仅使用这一早期生命周期数据,我们就可以在软件项目生命周期中非常快地、非常早地建立模型。此外,我们用这种方法表明,一个简单的模型(只有两个特点)可以概括成百多个软件项目。根据这一经验,我们怀疑以前关于一般软件工程缺陷预测模型的工作可能毫无必要地复杂一个内在的简单过程。此外,以前侧重于后期周期数据的工作需要重新审视,因为其结论来自相对不具有说服力的区域。Revicing:我们所有的数据和脚本都在网上进行 http://smagistrustamasistry/slistrucal。