The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.
翻译:视听基础模型的兴起凸显了可靠评估其多模态理解能力的重要性。VGGSound数据集通常被用作评估视听分类的基准。然而,我们的分析发现了VGGSound存在的若干局限性,包括标注不完整、类别部分重叠以及模态未对齐。这些问题导致了对听觉与视觉能力的评估失真。为应对这些局限,我们提出了VGGSounder——一个经过全面重新标注的多标签测试集,它扩展了VGGSound并专门用于评估视听基础模型。VGGSounder具备详细的模态标注,支持对特定模态性能的精确分析。此外,我们通过新提出的模态混淆度量,分析了在添加另一输入模态时性能下降的情况,从而揭示了模型的局限性。