Proactive approaches to security, such as adversary emulation, leverage information about threat actors and their techniques (Cyber Threat Intelligence, CTI). However, most CTI still comes in unstructured forms (i.e., natural language), such as incident reports and leaked documents. To support proactive security efforts, we present an experimental study on the automatic classification of unstructured CTI into attack techniques using machine learning (ML). We contribute with two new datasets for CTI analysis, and we evaluate several ML models, including both traditional and deep learning-based ones. We present several lessons learned about how ML can perform at this task, which classifiers perform best and under which conditions, which are the main causes of classification errors, and the challenges ahead for CTI analysis.
翻译:对安全采取积极主动的办法,例如对敌模拟,利用关于威胁行为者及其技术的信息(网络威胁情报,CTI),然而,大多数CTI仍然以未结构化的形式(即自然语言)出现,例如事件报告和泄漏的文件。为了支持积极主动的安全努力,我们提出一份实验性研究,说明如何将没有结构化的CTI自动分类为使用机器学习的进攻技术。我们为CTI分析提供了两个新的数据集,我们评估了若干ML模型,包括传统和深层的学习模型。我们介绍了关于ML如何完成这项任务的一些经验教训,即ML如何最出色,在何种条件下分类是分类错误的主要原因,以及CTI分析面临的挑战。