Password security plays a crucial role in cybersecurity, yet traditional password strength meters, which rely on static rules like character-type requirements, often fail. Such methods are easily bypassed by common password patterns (e.g., 'P@ssw0rd1!'), giving users a false sense of security. To address this, we implement and evaluate a password strength scoring system by comparing four machine learning models: Random Forest (RF), Support Vector Machine (SVM), a Convolutional Neural Network (CNN), and Logistic Regression with a dataset of over 660,000 real-world passwords. Our primary contribution is a novel hybrid feature engineering approach that captures nuanced vulnerabilities missed by standard metrics. We introduce features like leetspeak-normalized Shannon entropy to assess true randomness, pattern detection for keyboard walks and sequences, and character-level TF-IDF n-grams to identify frequently reused substrings from breached password datasets. our RF model achieved superior performance, achieving 99.12% accuracy on a held-out test set. Crucially, the interpretability of the Random Forest model allows for feature importance analysis, providing a clear pathway to developing security tools that offer specific, actionable feedback to users. This study bridges the gap between predictive accuracy and practical usability, resulting in a high-performance scoring system that not only reduces password-based vulnerabilities but also empowers users to make more informed security decisions.
翻译:密码安全在网络安全中扮演着至关重要的角色,然而依赖字符类型要求等静态规则的传统密码强度评估工具往往失效。此类方法容易被常见密码模式(例如'P@ssw0rd1!')绕过,使用户产生虚假的安全感。为解决这一问题,我们通过比较四种机器学习模型——随机森林(RF)、支持向量机(SVM)、卷积神经网络(CNN)和逻辑回归,并利用包含超过66万个真实世界密码的数据集,实现并评估了一种密码强度评分系统。我们的主要贡献在于提出了一种新颖的混合特征工程方法,能够捕捉标准指标遗漏的细微漏洞。我们引入了leet语归一化香农熵以评估真实随机性、针对键盘路径和序列的模式检测,以及字符级TF-IDF n-gram特征以识别从泄露密码数据集中频繁重复使用的子字符串。我们的随机森林模型取得了卓越性能,在保留测试集上达到了99.12%的准确率。关键在于,随机森林模型的可解释性支持特征重要性分析,为开发能够向用户提供具体、可操作反馈的安全工具提供了清晰路径。本研究弥合了预测精度与实际可用性之间的差距,最终构建出一个高性能评分系统,不仅能降低基于密码的漏洞风险,还能帮助用户做出更明智的安全决策。