强力稳妥的本地化,带有负边际效应 (MarginNCE: Robust Sound Localization with a Negative Margin)

The goal of this work is to localize sound sources in visual scenes with a self-supervised approach. Contrastive learning in the context of sound source localization leverages the natural correspondence between audio and visual signals where the audio-visual pairs from the same source are assumed as positive, while randomly selected pairs are negatives. However, this approach brings in noisy correspondences; for example, positive audio and visual pair signals that may be unrelated to each other, or negative pairs that may contain semantically similar samples to the positive one. Our key contribution in this work is to show that using a less strict decision boundary in contrastive learning can alleviate the effect of noisy correspondences in sound source localization. We propose a simple yet effective approach by slightly modifying the contrastive loss with a negative margin. Extensive experimental results show that our approach gives on-par or better performance than the state-of-the-art methods. Furthermore, we demonstrate that the introduction of a negative margin to existing methods results in a consistent improvement in performance.

翻译：这项工作的目标是以自我监督的方式在视觉场景中将声源来源本地化; 在声音来源本地化背景下的对比学习利用了声音和视觉信号之间的自然通信,即来自同一来源的视听配对被认为是正的,而随机选择的配对则是负的。然而,这一方法带来了噪音的通信;例如,可能相互不相干的积极声和视觉配对信号,或可能含有音义上相似的样本的负对等信号。我们在这项工作中的主要贡献是表明,在对比性学习中使用不那么严格的决定边界可以减轻音源本地化的噪音通信的效果。我们提出了一个简单而有效的方法,通过负差略微改变对比性损失。广泛的实验结果显示,我们的方法与最新技术方法相比,具有平行性或更好的性能。此外,我们证明,对现行方法引入负差的结果是持续地改进性能。