Phishing emails continue to pose a significant threat to cybersecurity by exploiting human vulnerabilities through deceptive content and malicious payloads. While Machine Learning (ML) models are effective at detecting phishing threats, their performance largely relies on the quality and diversity of the training data. This paper presents MeAJOR (Merged email Assets from Joint Open-source Repositories) Corpus, a novel, multi-source phishing email dataset designed to overcome critical limitations in existing resources. It integrates 135894 samples representing a broad number of phishing tactics and legitimate emails, with a wide spectrum of engineered features. We evaluated the dataset's utility for phishing detection research through systematic experiments with four classification models (RF, XGB, MLP, and CNN) across multiple feature configurations. Results highlight the dataset's effectiveness, achieving 98.34% F1 with XGB. By integrating broad features from multiple categories, our dataset provides a reusable and consistent resource, while addressing common challenges like class imbalance, generalisability and reproducibility.
翻译:钓鱼邮件通过欺骗性内容和恶意负载利用人类脆弱性,持续对网络安全构成重大威胁。尽管机器学习模型在检测钓鱼威胁方面效果显著,但其性能很大程度上依赖于训练数据的质量与多样性。本文提出MeAJOR(源自联合开源存储库的合并邮件资产)语料库,这是一个新颖的多源钓鱼邮件数据集,旨在克服现有资源的关键局限性。该数据集整合了135894个样本,涵盖广泛的钓鱼策略与正常邮件,并包含多维度工程特征。我们通过系统实验评估了该数据集在钓鱼检测研究中的实用性,使用四种分类模型(随机森林、XGBoost、多层感知机与卷积神经网络)在多种特征配置下进行测试。结果表明数据集具有显著效能,XGBoost模型达到98.34%的F1分数。通过整合多类别广泛特征,本数据集提供了可复用且一致的资源,同时解决了类别不平衡、泛化性与可复现性等常见挑战。