State-of-the-art 3D-field video-referenced Talking Face Generation (TFG) methods synthesize high-fidelity personalized talking-face videos in real time by modeling 3D geometry and appearance from reference portrait video. This capability raises significant privacy concerns regarding malicious misuse of personal portraits. However, no efficient defense framework exists to protect such videos against 3D-field TFG methods. While image-based defenses could apply per-frame 2D perturbations, they incur prohibitive computational costs, severe video quality degradation, failing to disrupt 3D information for video protection. To address this, we propose a novel and efficient video defense framework against 3D-field TFG methods, which protects portrait video by perturbing the 3D information acquisition process while maintain high-fidelity video quality. Specifically, our method introduces: (1) a similarity-guided parameter sharing mechanism for computational efficiency, and (2) a multi-scale dual-domain attention module to jointly optimize spatial-frequency perturbations. Extensive experiments demonstrate that our proposed framework exhibits strong defense capability and achieves a 47x acceleration over the fastest baseline while maintaining high fidelity. Moreover, it remains robust against scaling operations and state-of-the-art purification attacks, and the effectiveness of our design choices is further validated through ablation studies. Our project is available at https://github.com/Richen7418/VDF.
翻译:当前最先进的三维场视频参考说话人脸生成方法通过从参考肖像视频中建模三维几何与外观,能够实时合成高保真度的个性化说话人脸视频。这种能力引发了关于个人肖像被恶意滥用的重大隐私担忧。然而,目前尚无高效的防御框架来保护此类视频免受三维场说话人脸生成方法的攻击。虽然基于图像的防御方法可以对每帧施加二维扰动,但它们会产生难以承受的计算成本,导致视频质量严重下降,并且无法破坏用于视频保护的三维信息。为解决这一问题,我们提出了一种新颖且高效的针对三维场说话人脸生成方法的视频防御框架,该框架通过扰动三维信息获取过程来保护肖像视频,同时保持高保真度的视频质量。具体而言,我们的方法引入了:(1)一种基于相似性的参数共享机制以提高计算效率,以及(2)一个多尺度双域注意力模块来联合优化空间-频率扰动。大量实验表明,我们提出的框架展现出强大的防御能力,在保持高保真度的同时,相比最快的基线方法实现了47倍的加速。此外,它对缩放操作和最先进的净化攻击保持鲁棒性,并且我们设计方案的有效性通过消融研究得到了进一步验证。我们的项目发布于 https://github.com/Richen7418/VDF。