检测与缓解视频到音频生成中的插入幻觉 (Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation)

Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.

翻译：视频到音频生成技术在自动为视频合成声音方面取得了显著进展。然而，现有的评估指标主要关注语义与时序对齐，却忽视了一个关键失效模式：模型经常生成没有对应视觉源的声学事件，尤其是语音和音乐。我们将此现象定义为插入幻觉，并指出其是一种由数据集偏差（如画外音的普遍存在）驱动的系统性风险，而当前指标完全无法检测到这一问题。为应对这一挑战，我们首先开发了一个系统化的评估框架，该框架采用多个音频事件检测器的多数投票集成。我们还引入了两个新颖的指标来量化此问题的普遍性和严重性：IH@vid（出现幻觉的视频比例）和IH@dur（幻觉时长的比例）。在此基础上，我们提出了后验特征校正，这是一种新颖的免训练推理时方法，用于缓解插入幻觉。PFC采用两阶段流程：首先生成初始音频输出来检测幻觉片段，然后在对应时间戳掩码视频特征后重新生成音频。在多个主流V2A基准上的实验首先表明，最先进的模型存在严重的插入幻觉问题。相比之下，我们的PFC方法将幻觉的普遍性和持续时间平均降低了50%以上，且未降低——在某些情况下甚至提升了——音频质量和时序同步性等传统指标。我们的工作是首次正式定义、系统测量并有效缓解插入幻觉，为开发更可靠、更忠实的V2A模型铺平了道路。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日