Feature extraction and feature selection are the first tasks in pre-processing of input logs in order to detect cyber security threats and attacks while utilizing machine learning. When it comes to the analysis of heterogeneous data derived from different sources, these tasks are found to be time-consuming and difficult to be managed efficiently. In this paper, we present an approach for handling feature extraction and feature selection for security analytics of heterogeneous data derived from different network sensors. The approach is implemented in Apache Spark, using its python API, named pyspark.
翻译:地物提取和地物选择是预先处理输入日志的首要任务,目的是在利用机器学习时发现网络安全威胁和攻击;在分析来自不同来源的不同数据时,发现这些任务耗时且难以高效管理;在本文中,我们提出一种方法,用于处理来自不同网络传感器的不同数据的安全提取和特征选择;该方法在阿帕奇斯帕克实施,使用称为Pyspark的Python API, 名为Pyspark。