Software quality research increasingly relies on large-scale datasets that measure both the product and process aspects of software systems. However, existing resources often focus on limited dimensions, such as code smells, technical debt, or refactoring activity, thereby restricting comprehensive analyses across time and quality dimensions. To address this gap, we present the Software Quality Dataset (SQuaD), a multi-dimensional, time-aware collection of software quality metrics extracted from 450 mature open-source projects across diverse ecosystems, including Apache, Mozilla, FFmpeg, and the Linux kernel. By integrating nine state-of-the-art static analysis tools, i.e., SonarQube, CodeScene, PMD, Understand, CK, JaSoMe, RefactoringMiner, RefactoringMiner++, and PyRef, our dataset unifies over 700 unique metrics at method, class, file, and project levels. Covering a total of 63,586 analyzed project releases, SQuaD also provides version control and issue-tracking histories, software vulnerability data (CVE/CWE), and process metrics proven to enhance Just-In-Time (JIT) defect prediction. The SQuaD enables empirical research on maintainability, technical debt, software evolution, and quality assessment at unprecedented scale. We also outline emerging research directions, including automated dataset updates and cross-project quality modeling to support the continuous evolution of software analytics. The dataset is publicly available on ZENODO (DOI: 10.5281/zenodo.17566690).
翻译:软件质量研究日益依赖于大规模数据集,这些数据集同时度量软件系统的产品与过程维度。然而,现有资源通常聚焦于有限维度,如代码异味、技术债务或重构活动,从而限制了跨时间与质量维度的综合分析。为填补这一空白,我们提出了软件质量数据集(SQuaD),这是一个多维、时间感知的软件质量指标集合,提取自涵盖Apache、Mozilla、FFmpeg及Linux内核等多样化生态系统的450个成熟开源项目。通过集成九种前沿静态分析工具(即SonarQube、CodeScene、PMD、Understand、CK、JaSoMe、RefactoringMiner、RefactoringMiner++和PyRef),本数据集在方法、类、文件和项目层面统一了超过700项独特指标。覆盖总计63,586个已分析项目版本,SQuaD还提供了版本控制与问题追踪历史、软件漏洞数据(CVE/CWE),以及经证实可增强即时(JIT)缺陷预测的过程指标。SQuaD支持在可维护性、技术债务、软件演化和质量评估方面以前所未有的规模进行实证研究。我们还概述了新兴研究方向,包括自动化数据集更新与跨项目质量建模,以支持软件分析的持续演进。该数据集已在ZENODO公开(DOI: 10.5281/zenodo.17566690)。