In today's era of digital misinformation, we are increasingly faced with new threats posed by video falsification techniques. Such falsifications range from cheapfakes (e.g., lookalikes or audio dubbing) to deepfakes (e.g., sophisticated AI media synthesis methods), which are becoming perceptually indistinguishable from real videos. To tackle this challenge, we propose a multi-modal semantic forensic approach to discover clues that go beyond detecting discrepancies in visual quality, thereby handling both simpler cheapfakes and visually persuasive deepfakes. In this work, our goal is to verify that the purported person seen in the video is indeed themselves by detecting anomalous facial movements corresponding to the spoken words. We leverage the idea of attribution to learn person-specific biometric patterns that distinguish a given speaker from others. We use interpretable Action Units (AUs) to capture a person's face and head movement as opposed to deep CNN features, and we are the first to use word-conditioned facial motion analysis. We further demonstrate our method's effectiveness on a range of fakes not seen in training including those without video manipulation, that were not addressed in prior work.
翻译:在当今数字错误信息时代,我们日益面临视频伪造技术带来的新威胁。这些伪造手段包括廉价假象(如外观或音响假象)和深假(如尖端的AI媒体合成方法),这些假象与真实的视频在感知上是无法区分的。为了应对这一挑战,我们提出一种多式语义法法医学方法,以发现超越发现视觉质量差异的线索,从而既处理更简单的廉价假象,又处理视觉上具有说服力的深假象。在这项工作中,我们的目标是通过检测与口语相对应的异常面部运动来核实视频中所看到的人本身的确是自己。我们利用归属概念来学习将特定发言者与其他人区分开来的个人特有的生物鉴别模式。我们使用可解释行动股(AUS)来捕捉一个人的脸和头,而不是深层次的CNN特征,我们首先使用有文字限制的面部动作分析。我们进一步展示了在培训中看不到的一系列假面部(包括没有录像操作的假肢)方法的有效性。我们没有在先前的操作中找到这种方法。