The fast evolution of generative models has heightened the demand for reliable detection of AI-generated images. To tackle this challenge, we introduce FUSE, a hybrid system that combines spectral features extracted through Fast Fourier Transform with semantic features obtained from the CLIP's Vision encoder. The features are fused into a joint representation and trained progressively in two stages. Evaluations on GenImage, WildFake, DiTFake, GPT-ImgEval and Chameleon datasets demonstrate strong generalization across multiple generators. Our FUSE (Stage 1) model demonstrates state-of-the-art results on the Chameleon benchmark. It also attains 91.36% mean accuracy on the GenImage dataset, 88.71% accuracy across all tested generators, and a mean Average Precision of 94.96%. Stage 2 training further improves performance for most generators. Unlike existing methods, which often perform poorly on high-fidelity images in Chameleon, our approach maintains robustness across diverse generators. These findings highlight the benefits of integrating spectral and semantic features for generalized detection of images generated by AI.
翻译:生成模型的快速发展增强了对可靠检测AI生成图像的需求。为应对这一挑战,我们提出了FUSE——一个混合系统,它结合了通过快速傅里叶变换提取的频谱特征与从CLIP视觉编码器获得的语义特征。这些特征被融合成一个联合表示,并通过两个阶段进行渐进式训练。在GenImage、WildFake、DiTFake、GPT-ImgEval和Chameleon数据集上的评估表明,该方法在多种生成器上具有强大的泛化能力。我们的FUSE(第一阶段)模型在Chameleon基准测试中取得了最先进的结果。在GenImage数据集上达到了91.36%的平均准确率,在所有测试生成器上获得了88.71%的准确率,平均平均精度为94.96%。第二阶段训练进一步提升了在多数生成器上的性能。与现有方法(通常在Chameleon的高保真图像上表现不佳)不同,我们的方法在不同生成器之间保持了鲁棒性。这些发现凸显了融合频谱与语义特征对于AI生成图像的广义检测的益处。