JavaScript's widespread adoption has made it an attractive target for malicious attackers who employ sophisticated obfuscation techniques to conceal harmful code. Current deobfuscation tools suffer from critical limitations that severely restrict their practical effectiveness. Existing tools struggle with diverse input formats, address only specific obfuscation types, and produce cryptic output that impedes human analysis. To address these challenges, we present JSIMPLIFIER, a comprehensive deobfuscation tool using a multi-stage pipeline with preprocessing, abstract syntax tree-based static analysis, dynamic execution tracing, and Large Language Model (LLM)-enhanced identifier renaming. We also introduce multi-dimensional evaluation metrics that integrate control/data flow analysis, code simplification assessment, entropy measures and LLM-based readability assessments. We construct and release the largest real-world obfuscated JavaScript dataset with 44,421 samples (23,212 wild malicious + 21,209 benign samples). Evaluation shows JSIMPLIFIER outperforms existing tools with 100% processing capability across 20 obfuscation techniques, 100% correctness on evaluation subsets, 88.2% code complexity reduction, and over 4-fold readability improvement validated by multiple LLMs. Our results advance benchmarks for JavaScript deobfuscation research and practical security applications.
翻译:JavaScript的广泛应用使其成为恶意攻击者的主要目标,攻击者常采用复杂的混淆技术来隐藏有害代码。当前的反混淆工具存在关键局限性,严重制约了其实际有效性。现有工具难以处理多样化的输入格式,仅针对特定混淆类型,且生成晦涩的输出结果,阻碍了人工分析。为应对这些挑战,我们提出了JSIMPLIFIER,这是一个采用多阶段流程的综合性反混淆工具,包含预处理、基于抽象语法树的静态分析、动态执行追踪以及大语言模型(LLM)增强的标识符重命名。我们还引入了多维评估指标,整合了控制流/数据流分析、代码简化评估、熵度量以及基于LLM的可读性评估。我们构建并发布了最大的真实世界混淆JavaScript数据集,包含44,421个样本(23,212个野外恶意样本 + 21,209个良性样本)。评估结果表明,JSIMPLIFIER在20种混淆技术上实现了100%的处理能力,在评估子集上达到100%的正确性,代码复杂度降低了88.2%,且经多个LLM验证可读性提升超过4倍。我们的研究成果为JavaScript反混淆研究和实际安全应用推进了基准标准。