动态软件软件分类新数据集 (New Datasets for Dynamic Malware Classification)

Nowadays, malware and malware incidents are increasing daily, even with various anti-viruses systems and malware detection or classification methodologies. Many static, dynamic, and hybrid techniques have been presented to detect malware and classify them into malware families. Dynamic and hybrid malware classification methods have advantages over static malware classification methods by being highly efficient. Since it is difficult to mask malware behavior while executing than its underlying code in static malware classification, machine learning techniques have been the main focus of the security experts to detect malware and determine their families dynamically. The rapid increase of malware also brings the necessity of recent and updated datasets of malicious software. We introduce two new, updated datasets in this work: One with 9,795 samples obtained and compiled from VirusSamples and the one with 14,616 samples from VirusShare. This paper also analyzes multi-class malware classification performance of the balanced and imbalanced version of these two datasets by using Histogram-based gradient boosting, Random Forest, Support Vector Machine, and XGBoost models with API call-based dynamic malware classification. Results show that Support Vector Machine, achieves the highest score of 94% in the imbalanced VirusSample dataset, whereas the same model has 91% accuracy in the balanced VirusSample dataset. While XGBoost, one of the most common gradient boosting-based models, achieves the highest score of 90% and 80%.in both versions of the VirusShare dataset. This paper also presents the baseline results of VirusShare and VirusSample datasets by using the four most widely known machine learning techniques in dynamic malware classification literature. We believe that these two datasets and baseline results enable researchers in this field to test and validate their methods and approaches.

翻译：目前,恶意软件和恶意软件事件正在日复一日地增加,即使有各种反病毒系统和恶意软件检测或分类方法。许多静态、动态和混合技术已经推出,以检测恶意软件并将其分类为恶意软件家庭。动态和混合的恶意软件分类方法由于效率很高,对静态恶意软件分类方法具有优势。由于很难在静态恶意软件分类中执行比其基本代码更平衡和不平衡的错误软件分类时掩盖恶意行为,因此机器学习技术一直是安全专家发现恶意软件和动态确定其家庭的主要焦点。恶意软件的迅速增加也带来了最新和更新恶意软件数据集的必要性。我们在这项工作中引入了两个新的更新数据集:一个是9,795个样本,而一个是病毒样本,而另一个是14,616个样本。本文还分析了这两套数据集的多级恶意软件的分类性能,同时使用了基于直观梯度梯度的梯度加速度、随机森林、支持矢量机和XOO的模型, 以及基于API的动态恶意软件的动态软件分类。结果显示,大多数VERSerma Ral sal sal sal sal deal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal salation sal slation slation slation sal sal slational slation sal sal sal sal sal slation sal sal sal slation slation slation slation slation slation slation slation slation slation slation slational sal sal slationald slations slations sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sald sald sald sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sald sal sal sal sal sal sal sal sal sal sal sal s

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

53+阅读 · 2021年1月20日

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【干货书】机器学习Primer，122页pdf

专知会员服务

109+阅读 · 2020年10月5日

【伯克利】元学习的元基线，A New Meta-Baseline for Few-Shot Learning

专知会员服务

67+阅读 · 2020年3月28日