CySecBERT: 网络安全域域的域名开发语言模型 (CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain)

The field of cybersecurity is evolving fast. Experts need to be informed about past, current and - in the best case - upcoming threats, because attacks are becoming more advanced, targets bigger and systems more complex. As this cannot be addressed manually, cybersecurity experts need to rely on machine learning techniques. In the texutual domain, pre-trained language models like BERT have shown to be helpful, by providing a good baseline for further fine-tuning. However, due to the domain-knowledge and many technical terms in cybersecurity general language models might miss the gist of textual information, hence doing more harm than good. For this reason, we create a high-quality dataset and present a language model specifically tailored to the cybersecurity domain, which can serve as a basic building block for cybersecurity systems that deal with natural language. The model is compared with other models based on 15 different domain-dependent extrinsic and intrinsic tasks as well as general tasks from the SuperGLUE benchmark. On the one hand, the results of the intrinsic tasks show that our model improves the internal representation space of words compared to the other models. On the other hand, the extrinsic, domain-dependent tasks, consisting of sequence tagging and classification, show that the model is best in specific application scenarios, in contrast to the others. Furthermore, we show that our approach against catastrophic forgetting works, as the model is able to retrieve the previously trained domain-independent knowledge. The used dataset and trained model are made publicly available

翻译：网络安全领域正在迅速发展。专家们需要了解过去、现在和(在最佳情况下)即将出现的威胁,因为袭击正在变得更加先进,目标更大,系统更加复杂。由于无法人工处理,网络安全专家需要依靠机器学习技术。在特克休托尔域,如BERT等经过预先训练的语言模型为进一步微调提供了良好的基线,从而证明有帮助。然而,由于域知识以及网络安全通用语言模型中的许多技术术语,可能错过了文本信息基调,从而造成更多的伤害。为此,我们创建了一个高质量的数据集,并展示了专门针对网络安全领域的语言模型,这可以作为处理自然语言的网络安全系统的基本基石。在Texututututrical和内在任务以及超级GLUE基准的一般任务上,该模型显示,由于域域内现有模型的结果,与其它模型相比,我们内部的文字代表空间比其他模型更难。另一方面,在经过训练的域域域域域内应用中,外部模型是经过训练的模型,显示我们过去采用的具体数据序列。