FoleyBench：视频到音频模型的基准测试集 (FoleyBench: A Benchmark For Video-to-Audio Models)

Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on-screen events. The dataset is built using an automated, scalable pipeline applied to in-the-wild internet videos from YouTube-based and Vimeo-based sources. Compared to past datasets, we show that videos from FoleyBench have stronger coverage of sound categories from a taxonomy specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine-grained analysis of model performance and failure modes. We benchmark several state-of-the-art V2A models, evaluating them on audio quality, audio-video alignment, temporal synchronization, and audio-text consistency. Samples are available at: https://gclef-cmu.org/foleybench

翻译：视频到音频生成（V2A）在电影后期制作、增强现实/虚拟现实（AR/VR）以及声音设计等领域的重要性日益凸显，尤其是在为屏幕上的动作同步生成拟音（Foley）音效方面。拟音要求生成的音频既在语义上与可见事件对齐，又在时间上与其节奏同步。然而，由于缺乏专门针对拟音式场景的基准测试集，当前评估与下游应用之间存在脱节。我们发现，过去评估数据集中74%的视频存在视听对应关系不佳的问题。此外，这些数据集主要由语音和音乐主导，而这些领域并不属于拟音的应用范畴。为填补这一空白，我们推出了FoleyBench，这是首个专门为拟音式V2A评估设计的大规模基准测试集。FoleyBench包含5,000个（视频、真实音频、文本描述）三元组，每个组均呈现可见的声源，其音频与屏幕事件具有因果关联。该数据集通过一个自动化、可扩展的流程构建，应用于来自YouTube和Vimeo等平台的互联网野生视频。与以往数据集相比，我们证明FoleyBench中的视频在专门为拟音声音设计的分类体系中，覆盖了更广泛的声音类别。每个片段还附有元数据标签，包括声源复杂度、UCS/AudioSet类别和视频长度，从而支持对模型性能及失败模式的细粒度分析。我们对多种最先进的V2A模型进行了基准测试，评估了它们在音频质量、音视频对齐、时间同步以及音频-文本一致性方面的表现。样本可在以下网址获取：https://gclef-cmu.org/foleybench