视频到音频生成中的插入幻觉检测与缓解 (Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation)

Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.

翻译：视频到音频生成在自动为视频合成声音方面取得了显著进展。然而，现有评估指标主要关注语义和时间对齐，却忽视了一个关键故障模式：模型常会生成与视觉源无对应关系的声学事件，尤其是语音和音乐。我们将此现象定义为插入幻觉，并指出其作为一种系统性风险，源于数据集偏差（如画外音的普遍存在），而当前指标完全无法检测。为应对这一挑战，我们首先开发了一个系统化评估框架，采用多数投票集成多个音频事件检测器。我们还引入了两个新指标来量化该问题的普遍性和严重程度：IH@vid（出现幻觉的视频比例）和IH@dur（幻觉持续时长比例）。在此基础上，我们提出后验特征校正，一种无需训练、在推理时缓解插入幻觉的新方法。PFC采用两阶段流程：首先生成初始音频输出以检测幻觉片段，随后在对应时间戳掩码视频特征后重新生成音频。在多个主流V2A基准上的实验首次表明，当前最先进模型存在严重的插入幻觉问题。相比之下，我们的PFC方法将幻觉的普遍性和持续时间平均降低超过50%，且未损害——在某些情况下甚至提升了——音频质量和时间同步性等传统指标。本研究首次正式定义、系统测量并有效缓解了插入幻觉，为开发更可靠、更忠实的V2A模型奠定了基础。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日