Malicious software are categorized into families based on their static and dynamic characteristics, infection methods, and nature of threat. Visual exploration of malware instances and families in a low dimensional space helps in giving a first overview about dependencies and relationships among these instances, detecting their groups and isolating outliers. Furthermore, visual exploration of different sets of features is useful in assessing the quality of these sets to carry a valid abstract representation, which can be later used in classification and clustering algorithms to achieve a high accuracy. In this paper, we investigate one of the best dimensionality reduction techniques known as t-SNE to reduce the malware representation from a high dimensional space consisting of thousands of features to a low dimensional space. We experiment with different feature sets and depict malware clusters in 2-D. Surprisingly, t-SNE does not only provide nice 2-D drawings, but also dramatically increases the generalization power of SVM classifiers. Moreover, obtained results showed that cross-validation accuracy is much better using the 2-D embedded representation of samples than using the original high-dimensional representation.
翻译:此外,对不同特征进行直观探索有助于评估这些数据集的质量,以体现有效的抽象图象,这些图象后来可用于分类和组合算法,以达到很高的精确度。在本文中,我们调查了被称为t-SNE的减少维度最佳技术之一,以降低由数千个特征组成的高维空间的恶意软件的表示面,从一个由数千个特征组成的高维空间到低维空间。我们实验不同的地物组和描述2D中的恶意软件组群。令人惊讶的是,t-SNE不仅提供了不错的2D图谱,而且还大大提高了SVM分类器的通用能力。此外,获得的结果显示,交叉校验准确性比使用原高维面图象的2D内嵌式要好得多。