Malware classification is an important and challenging problem in information security. Modern malware classification techniques rely on machine learning models that can be trained on features such as opcode sequences, API calls, and byte $n$-grams, among many others. In this research, we consider opcode features. We implement hybrid machine learning techniques, where we engineer feature vectors by training hidden Markov models -- a technique that we refer to as HMM2Vec -- and Word2Vec embeddings on these opcode sequences. The resulting HMM2Vec and Word2Vec embedding vectors are then used as features for classification algorithms. Specifically, we consider support vector machine (SVM), $k$-nearest neighbor ($k$-NN), random forest (RF), and convolutional neural network (CNN) classifiers. We conduct substantial experiments over a variety of malware families. Our experiments extend well beyond any previous work in this field.
翻译:恶意软件分类是信息安全方面的一个重要和具有挑战性的问题。现代恶意软件分类技术依赖于机器学习模型,这些模型可以对诸如下列特性进行培训:Ococo 序列、API调用和字节美元-克,等等。在这个研究中,我们考虑了代码特性。我们采用了混合机器学习技术,通过培训隐藏的Markov模型(我们称之为HMM2Vec)和Word2Vec(Word2Vec)嵌入式嵌入式嵌入式系统,在这些代码序列上嵌入。由此产生的HMM2Vec(HMM2Vec)和Word2Vec(Word2Vec)嵌入式系统随后被用作分类算法的特性。具体地说,我们考虑支持矢量机(SVM)、美元最近的邻居(KNNN美元)、随机森林(RF)和演动神经网络(CNN)分类师。我们为各种恶意软件家庭进行了大量实验。我们的实验远远超出了这个领域以前的任何工作。