The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.
翻译:生成视频的真实感和质量不断提升,使得人类越来越难以识别深度伪造内容,因此需要更多地依赖自动深度伪造检测器。然而,深度伪造检测器同样容易出错,且其决策缺乏可解释性,使人类易受基于深度伪造的欺诈和虚假信息的影响。为此,我们引入了ExDDV,这是首个用于视频可解释深度伪造检测的数据集和基准。ExDDV包含约5.4K个真实和深度伪造视频,这些视频均经过人工标注,附有文本描述(用于解释伪造痕迹)和点击标记(用于指出伪造痕迹)。我们在ExDDV上评估了多种视觉-语言模型,并采用了不同的微调和上下文学习策略进行实验。结果表明,文本和点击监督均不可或缺,以开发出能够定位并描述所观察伪造痕迹的鲁棒可解释深度伪造视频模型。我们的新数据集及用于复现结果的代码可在https://github.com/vladhondru25/ExDDV获取。