Reliable identification of encrypted file fragments is a requirement for several security applications, including ransomware detection, digital forensics, and traffic analysis. A popular approach consists of estimating high entropy as a proxy for randomness. However, many modern content types (e.g. office documents, media files, etc.) are highly compressed for storage and transmission efficiency. Compression algorithms also output high-entropy data, thus reducing the accuracy of entropy-based encryption detectors. Over the years, a variety of approaches have been proposed to distinguish encrypted file fragments from high-entropy compressed fragments. However, these approaches are typically only evaluated over a few, select data types and fragment sizes, which makes a fair assessment of their practical applicability impossible. This paper aims to close this gap by comparing existing statistical tests on a large, standardized dataset. Our results show that current approaches cannot reliably tell apart encryption and compression, even for large fragment sizes. To address this issue, we design EnCoD, a learning-based classifier which can reliably distinguish compressed and encrypted data, starting with fragments as small as 512 bytes. We evaluate EnCoD against current approaches over a large dataset of different data types, showing that it outperforms current state-of-the-art for most considered fragment sizes and data types.
翻译:可靠地识别加密文件碎片是若干安全应用软件的一项要求,包括赎金软件的检测、数字法证和交通分析。一种流行的方法包括估计高灵敏度作为随机性的替代物。然而,许多现代内容类型(如办公室文件、媒体文件等)对于存储和传输效率而言高度压缩。压缩算法还输出高渗透性数据,从而降低基于加密的加密探测器的准确性。多年来,提出了各种办法,以区分加密文件碎片和高渗透性压缩碎片。然而,这些方法通常仅对少数数据类型和碎片大小进行评估,从而无法公平地评估其实际适用性。本文的目的是通过比较对大型标准化数据集的现有统计测试来缩小这一差距。我们的结果表明,目前的方法无法可靠地分辨出加密和压缩,即使是大碎片大小的加密探测器也是如此。为了解决这一问题,我们设计了一个基于学习的分类方法,可以可靠地区分压缩和加密数据,从碎片小的512个字节开始。我们对照目前不同类型的数据,评估了ECOD,以显示目前不同类型的数据形式,显示其目前的大小。