This paper focuses on the weakly-supervised audio-visual video parsing task, which aims to recognize all events belonging to each modality and localize their temporal boundaries. This task is challenging because only overall labels indicating the video events are provided for training. However, an event might be labeled but not appear in one of the modalities, which results in a modality-specific noisy label problem. In this work, we propose a training strategy to identify and remove modality-specific noisy labels dynamically. It is motivated by two key observations: 1) networks tend to learn clean samples first; and 2) a labeled event would appear in at least one modality. Specifically, we sort the losses of all instances within a mini-batch individually in each modality, and then select noisy samples according to the relationships between intra-modal and inter-modal losses. Besides, we also propose a simple but valid noise ratio estimation method by calculating the proportion of instances whose confidence is below a preset threshold. Our method makes large improvements over the previous state of the arts (\eg, from 60.0\% to 63.8\% in segment-level visual metric), which demonstrates the effectiveness of our approach. Code and trained models are publicly available at \url{https://github.com/MCG-NJU/JoMoLD}.
翻译:本文重点介绍受微弱监督的视听视频分析任务,目的是承认属于每种模式的所有事件,并区分其时间界限。这一任务之所以具有挑战性,是因为只为培训提供了显示视频事件的总体标签。然而,一个事件可能贴上标签,但没有出现在其中一种模式中,造成特定模式的吵闹标签问题。在这项工作中,我们提出了一个培训战略,以动态方式识别和删除特定模式的吵闹标签。它受到两项关键观察的驱动:(1) 网络往往首先学习干净的样本;和(2) 标记事件至少以一种方式出现。具体地说,我们将所有事件的损失都分类在每种模式中的微型批次中,然后根据内部模式和模式间损失之间的关系选择噪音样本。此外,我们还提出了一个简单而有效的噪音比率估计方法,计算出信任低于预定门槛的比例。我们的方法比以往的艺术状态(例如,从6.0.0 ⁇ -至63.8 ⁇ -)大改进了。我们的方法是可公开使用的代码和经过培训的模型。