Cyber security threats have been growing significantly in both volume and sophistication over the past decade. This poses great challenges to malware detection without considerable automation. In this paper, we have proposed a novel approach by extending our recently suggested artificial neural network (ANN) based model with feature selection using the principal component analysis (PCA) technique for malware detection. The effectiveness of the approach has been successfully demonstrated with the application in PDF malware detection. A varying number of principal components is examined in the comparative study. Our evaluation shows that the model with PCA can significantly reduce feature redundancy and learning time with minimum impact on data information loss, as confirmed by both training and testing results based on around 105,000 real-world PDF documents. Of the evaluated models using PCA, the model with 32 principal feature components exhibits very similar training accuracy to the model using the 48 original features, resulting in around 33% dimensionality reduction and 22% less learning time. The testing results further confirm the effectiveness and show that the model is able to achieve 93.17% true positive rate (TPR) while maintaining the same low false positive rate (FPR) of 0.08% as the case when no feature selection is applied, which significantly outperforms all evaluated seven well known commercial antivirus (AV) scanners of which the best scanner only has a TPR of 84.53%.
翻译:过去十年来,网络安全威胁在数量和复杂性两方面都显著增长。这给在没有相当自动化的情况下发现恶意软件带来了巨大的挑战。在本文中,我们提出了一种新颖的方法,即扩大我们最近建议的人工神经网络模型(ANN),利用主要部件分析(PCA)技术进行特征选择,使用主要部件分析(PCA)技术进行恶意软件检测。在PDF恶意软件检测中应用了大约33%的维度减少,学习时间减少了22%,成功证明了这一方法的有效性。在比较研究中审查了不同的主要组成部分。我们的评估表明,与CCA的模型可以大大减少功能冗余和学习时间,对数据信息损失的影响最小,这在大约105,000份真实世界PDF文件的培训和测试结果中都得到了证实。在经过评估的模型中,有32个主要特性组件的模型显示,与使用48个原始特性分析(PCA)的模型非常相似的培训准确性,结果大约减少了33%的维度,学习时间减少了22%。测试结果进一步证实,并表明该模型能够达到93.17%的真实正率(TR),同时保持同样的低正率(FPR)为0.88%的反射速率,因为没有应用甚深为TAVRMIS的7的扫描仪,因此只有最深的模型。