规模化恶意软件聚类：首个全基准研究 (Clustering Malware at Scale: A First Full-Benchmark Study)

from arxiv, pre-print of the paper (i.e. "submitted manuscript" version); small updates to the tables, figures, and text were made in order to report the correct results on Ember

Recent years have shown that malware attacks still happen with high frequency. Malware experts seek to categorize and classify incoming samples to confirm their trustworthiness or prove their maliciousness. One of the ways in which groups of malware samples can be identified is through malware clustering. Despite the efforts of the community, malware clustering which incorporates benign samples has been under-explored. Moreover, despite the availability of larger public benchmark malware datasets, malware clustering studies have avoided fully utilizing these datasets in their experiments, often resorting to small datasets with only a few families. Additionally, the current state-of-the-art solutions for malware clustering remain unclear. In our study, we evaluate malware clustering quality and establish the state-of-the-art on Bodmas and Ember - two large public benchmark malware datasets. Ours is the first study of malware clustering performed on whole malware benchmark datasets. Additionally, we extend the malware clustering task by incorporating benign samples. Our results indicate that incorporating benign samples does not significantly degrade clustering quality. We find that there are differences in the quality of the created clusters between Ember and Bodmas, as well as a private industry dataset. Contrary to popular opinion, our top clustering performers are K-Means and BIRCH, with DBSCAN and HAC falling behind.

翻译：近年来，恶意软件攻击仍以高频率发生。恶意软件专家致力于对传入样本进行分类和鉴定，以确认其可信性或证明其恶意性。恶意软件聚类是识别恶意软件样本组的一种方法。尽管学术界已付出诸多努力，但包含良性样本的恶意软件聚类研究仍显不足。此外，尽管已有更大规模的公共基准恶意软件数据集可用，现有恶意软件聚类研究在实验中往往避免充分利用这些数据集，通常仅使用包含少数家族的小型数据集。同时，当前恶意软件聚类的最先进解决方案仍不明确。在本研究中，我们评估了恶意软件聚类的质量，并在Bodmas和Ember这两个大型公共基准恶意软件数据集上确立了最新技术水平。这是首次在整个恶意软件基准数据集上进行的恶意软件聚类研究。此外，我们通过纳入良性样本扩展了恶意软件聚类任务。结果表明，纳入良性样本不会显著降低聚类质量。我们发现，Ember、Bodmas以及一个私有行业数据集所生成聚类的质量存在差异。与普遍观点相反，我们的最佳聚类表现算法是K-Means和BIRCH，而DBSCAN和HAC则表现落后。