Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.
翻译:视频到音频生成技术在自动为视频合成声音方面取得了显著进展。然而,现有的评估指标主要关注语义与时序对齐,却忽视了一个关键失效模式:模型经常生成没有对应视觉源的声学事件,尤其是语音和音乐。我们将此现象定义为插入幻觉,并指出其是一种由数据集偏差(如画外音的普遍存在)驱动的系统性风险,而当前指标完全无法检测到这一问题。为应对这一挑战,我们首先开发了一个系统化的评估框架,该框架采用多个音频事件检测器的多数投票集成。我们还引入了两个新颖的指标来量化此问题的普遍性和严重性:IH@vid(出现幻觉的视频比例)和IH@dur(幻觉时长的比例)。在此基础上,我们提出了后验特征校正,这是一种新颖的免训练推理时方法,用于缓解插入幻觉。PFC采用两阶段流程:首先生成初始音频输出来检测幻觉片段,然后在对应时间戳掩码视频特征后重新生成音频。在多个主流V2A基准上的实验首先表明,最先进的模型存在严重的插入幻觉问题。相比之下,我们的PFC方法将幻觉的普遍性和持续时间平均降低了50%以上,且未降低——在某些情况下甚至提升了——音频质量和时序同步性等传统指标。我们的工作是首次正式定义、系统测量并有效缓解插入幻觉,为开发更可靠、更忠实的V2A模型铺平了道路。