带有单字嵌入功能的软件软件分类 (Malware Classification with Word Embedding Features)

Malware classification is an important and challenging problem in information security. Modern malware classification techniques rely on machine learning models that can be trained on features such as opcode sequences, API calls, and byte $n$-grams, among many others. In this research, we consider opcode features. We implement hybrid machine learning techniques, where we engineer feature vectors by training hidden Markov models -- a technique that we refer to as HMM2Vec -- and Word2Vec embeddings on these opcode sequences. The resulting HMM2Vec and Word2Vec embedding vectors are then used as features for classification algorithms. Specifically, we consider support vector machine (SVM), $k$-nearest neighbor ($k$-NN), random forest (RF), and convolutional neural network (CNN) classifiers. We conduct substantial experiments over a variety of malware families. Our experiments extend well beyond any previous work in this field.

翻译：恶意软件分类是信息安全方面的一个重要和具有挑战性的问题。现代恶意软件分类技术依赖于机器学习模型,这些模型可以对诸如下列特性进行培训:Ococo 序列、API调用和字节美元-克,等等。在这个研究中,我们考虑了代码特性。我们采用了混合机器学习技术,通过培训隐藏的Markov模型(我们称之为HMM2Vec)和Word2Vec(Word2Vec)嵌入式嵌入式嵌入式系统,在这些代码序列上嵌入。由此产生的HMM2Vec(HMM2Vec)和Word2Vec(Word2Vec)嵌入式系统随后被用作分类算法的特性。具体地说,我们考虑支持矢量机(SVM)、美元最近的邻居(KNNN美元)、随机森林(RF)和演动神经网络(CNN)分类师。我们为各种恶意软件家庭进行了大量实验。我们的实验远远超出了这个领域以前的任何工作。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【DeepMind】强化学习教程，83页ppt

专知会员服务

158+阅读 · 2020年8月7日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

史上机器学习 &深度学习课程大合集，一站搞定，Deep Learning Drizzle

专知会员服务

175+阅读 · 2020年5月10日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日