With the increase in the variety and quantity of malware, there is an urgent need to speed up the diagnosis and the analysis of malware. Extracting the malware family-related tokens from AV (Anti-Virus) labels, provided by online anti-virus engines, paves the way for pre-diagnosing the malware. Automatically extract the vital information from AV labels will greatly enhance the detection ability of security enterprises and equip the research ability of security analysts. Recent works like AVCLASS and AVCLASS2 try to extract the attributes of malware from AV labels and establish the taxonomy based on expert knowledge. However, due to the uncertain trend of complicated malicious behaviors, the system needs the following abilities to face the challenge: preserving vital semantics, being expansible, and free from expert knowledge. In this work, we present AVMiner, an expansible malware tagging system that can mine the most vital tokens from AV labels. AVMiner adopts natural language processing techniques and clustering methods to generate a sequence of tokens without expert knowledge ranked by importance. AVMiner can self-update when new samples come. Finally, we evaluate AVMiner on over 8,000 samples from well-known datasets with manually labeled ground truth, which outperforms previous works.
翻译:随着恶意软件的种类和数量的增加,迫切需要加快对恶意软件的诊断和分析。从在线反病毒引擎提供的AV(Anti-Virus)标签上提取与恶意软件有关的家庭标记,为预先诊断恶意软件铺平了道路。自动从AV标签上提取重要信息将大大增强安全企业的检测能力,并装备安全分析员的研究能力。最近的一些工作,如AVLACASS和AVLACASS2, 试图从AV标签上提取恶意软件的属性,并根据专家知识建立分类学。然而,由于复杂的恶意行为的不确定趋势,该系统需要以下能力来应对挑战:保存关键的语义,可以推广,并且没有专家知识。在这项工作中,我们介绍AV标签上最关键符号的防恶意标记系统。AViner采用自然语言处理技术和组合方法,以生成没有专家知识的标志序列,最后,从AVILA样本中进行我们所了解的样本排序。