The increasing realism and accessibility of deepfakes have raised critical concerns about media authenticity and information integrity. Despite recent advances, deepfake detection models often struggle to generalize beyond their training distributions, particularly when applied to media content found in the wild. In this work, we present a robust video deepfake detection framework with strong generalization that takes advantage of the rich facial representations learned by face foundation models. Our method is built on top of FSFM, a self-supervised model trained on real face data, and is further fine-tuned using an ensemble of deepfake datasets spanning both face-swapping and face-reenactment manipulations. To enhance discriminative power, we incorporate triplet loss variants during training, guiding the model to produce more separable embeddings between real and fake samples. Additionally, we explore attribution-based supervision schemes, where deepfakes are categorized by manipulation type or source dataset, to assess their impact on generalization. Extensive experiments across diverse evaluation benchmarks demonstrate the effectiveness of our approach, especially in challenging real-world scenarios.
翻译:深度伪造技术日益逼真且易于获取,引发了关于媒体真实性与信息完整性的严重关切。尽管近期研究取得进展,深度伪造检测模型在超出其训练分布范围时往往泛化能力不足,尤其是在处理真实场景中的媒体内容时。本研究提出一种具有强泛化能力的鲁棒视频深度伪造检测框架,该框架利用了人脸基础模型所学习的丰富面部表征。我们的方法基于FSFM(一种在真实人脸数据上训练的自监督模型)构建,并通过融合涵盖人脸替换与人脸重演操作的多种深度伪造数据集进行微调。为增强判别能力,我们在训练中引入三元组损失变体,引导模型生成真实与伪造样本间更具可分性的嵌入表示。此外,我们探索了基于属性归因的监督方案——将深度伪造按操纵类型或源数据集分类,以评估其对泛化能力的影响。跨多个评估基准的广泛实验验证了本方法的有效性,尤其在具有挑战性的现实场景中表现突出。