Defectors: 一份用于缺陷预测的大规模、多样化Python数据集 (Defectors: A Large, Diverse Python Dataset for Defect Prediction) - 专知论文

会员服务 ·

0

缺陷预测 · 数据集 · Python · ML · 大规模数据 ·

2023 年 4 月 11 日

Defectors: A Large, Diverse Python Dataset for Defect Prediction

翻译：Defectors: 一份用于缺陷预测的大规模、多样化Python数据集

Parvez Mahbub,Ohiduzzaman Shuvo,Mohammad Masudur Rahman

Defect prediction has been a popular research topic where machine learning (ML) and deep learning (DL) have found numerous applications. However, these ML/DL-based defect prediction models are often limited by the quality and size of their datasets. In this paper, we present Defectors, a large dataset for just-in-time and line-level defect prediction. Defectors consists of $\approx$ 213K source code files ($\approx$ 93K defective and $\approx$ 120K defect-free) that span across 24 popular Python projects. These projects come from 18 different domains, including machine learning, automation, and internet-of-things. Such a scale and diversity make Defectors a suitable dataset for training ML/DL models, especially transformer models that require large and diverse datasets. We also foresee several application areas of our dataset including defect prediction and defect explanation. Dataset link: https://doi.org/10.5281/zenodo.7708984

翻译：缺陷预测一直是热门的研究课题，机器学习(ML)和深度学习(DL)在其中有着众多的应用。然而，这些基于ML /DL 的缺陷预测模型常常受数据集质量和规模的限制。在本文中，我们提出了Defectors——一个用于即时和行级缺陷预测的大规模数据集。Defectors包括近213K个源代码文件(近93K个有缺陷的及近120K个无缺陷的)，跨越了24个流行的Python项目。这些项目来自18个不同的领域，包括机器学习、自动化和物联网等。这样的规模和多样性使得Defectors成为训练ML/DL模型、尤其是需要大规模和多样化数据集的Transformer模型的适合数据集。我们还预见到数据集的几个应用领域，包括缺陷预测和缺陷解释。数据集链接：https://doi.org/10.5281/zenodo.7708984

0

相关内容

缺陷预测

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

用于大型遥感影像检索的深度学习，Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives

用于大型遥感影像检索的深度学习，Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives

专知会员服务

39+阅读 · 2020年4月6日

【O’Reilly讲座】基于深度学习的异常检测方法用于检测大型数据集的质量：Anomaly detection using deep learning to measure the quality of large datasets

【O’Reilly讲座】基于深度学习的异常检测方法用于检测大型数据集的质量：Anomaly detection using deep learning to measure the quality of large datasets

专知会员服务

31+阅读 · 2020年1月11日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

谷子光周期、温度敏感性及相关性状的全基因组关联分析

国家自然科学基金

0+阅读 · 2014年12月31日

基于电磁超声导波层析成像的管道复杂缺陷轮廓定量描述

国家自然科学基金

0+阅读 · 2013年12月31日

基于群体记忆的开源软件缺陷预测、诊断和定位研究

国家自然科学基金

1+阅读 · 2012年12月31日

面向Deep Web的大规模知识库自动构建方法研究

国家自然科学基金

4+阅读 · 2011年12月31日

基于三元组可比语料库的语言自动剖析技术应用研究

国家自然科学基金

0+阅读 · 2011年12月31日

Deep Learning and Symbolic Regression for Discovering Parametric Equations

Arxiv

0+阅读 · 2023年5月28日

Comparison of Pedestrian Prediction Models from Trajectory and Appearance Data for Autonomous Driving

Arxiv

0+阅读 · 2023年5月25日

Deep Learning for UAV-based Object Detection and Tracking: A Survey

Arxiv

64+阅读 · 2021年10月25日

DOTA: A Large-scale Dataset for Object Detection in Aerial Images

Arxiv

19+阅读 · 2018年1月27日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

VIP会员

文章信息

相关主题

大规模数据

相关VIP内容

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

用于大型遥感影像检索的深度学习，Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives

用于大型遥感影像检索的深度学习，Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives

专知会员服务

39+阅读 · 2020年4月6日

【O’Reilly讲座】基于深度学习的异常检测方法用于检测大型数据集的质量：Anomaly detection using deep learning to measure the quality of large datasets

【O’Reilly讲座】基于深度学习的异常检测方法用于检测大型数据集的质量：Anomaly detection using deep learning to measure the quality of large datasets

专知会员服务

31+阅读 · 2020年1月11日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

大语言模型中的事件抽取：方法、模态与未来展望的全面综述

美海军作战管理系统：变革战场空间的二十年

【MIT博士论文】以语言为中心的医学影像理解

俄罗斯“沙希德”/“天竺葵”攻击无人机

相关资讯

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

相关论文

Deep Learning and Symbolic Regression for Discovering Parametric Equations

Arxiv

0+阅读 · 2023年5月28日

Comparison of Pedestrian Prediction Models from Trajectory and Appearance Data for Autonomous Driving

Arxiv

0+阅读 · 2023年5月25日

Deep Learning for UAV-based Object Detection and Tracking: A Survey

Arxiv

64+阅读 · 2021年10月25日

DOTA: A Large-scale Dataset for Object Detection in Aerial Images

Arxiv

19+阅读 · 2018年1月27日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

相关基金

谷子光周期、温度敏感性及相关性状的全基因组关联分析

国家自然科学基金

0+阅读 · 2014年12月31日

基于电磁超声导波层析成像的管道复杂缺陷轮廓定量描述

国家自然科学基金

0+阅读 · 2013年12月31日

基于群体记忆的开源软件缺陷预测、诊断和定位研究

国家自然科学基金

1+阅读 · 2012年12月31日

面向Deep Web的大规模知识库自动构建方法研究

国家自然科学基金

4+阅读 · 2011年12月31日

基于三元组可比语料库的语言自动剖析技术应用研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员