Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the sensory streams. We propose a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream. Our approach is simple yet effective. We train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams. To train and evaluate our method, we introduce a large-scale video dataset, YouTube-ASMR-300K, with spatial audio comprising over 900 hours of footage. We demonstrate that understanding spatial correspondence enables models to perform better on three audio-visual tasks, achieving quantitative gains over supervised and self-supervised baselines that do not leverage spatial audio cues. We also show how to extend our self-supervised approach to 360 degree videos with ambisonic audio.
翻译:自我监督的视听学习旨在通过利用视觉和音频输入之间的通信,获取视频的有用表现形式。现有方法主要侧重于将感官流之间的语义信息匹配起来。我们提议了一项新的自我监督任务,以利用正向原则:将音频流中的空间信息与声源在视觉流中的方位相匹配。我们的方法简单而有效。我们训练了一个模型,以确定左和右音频频道是否被翻转,迫使其在视觉和声频流之间的空间定位上做出解释。为了培训和评估我们的方法,我们引入了一个大型视频数据集,YouTube-ASMR-300K,由900多小时的视频组成空间音频组成。我们表明,理解空间通信使模型能够更好地执行三种视听任务,在监督和自我监督的基线上取得数量上的好处,而不会利用空间音频提示。我们还展示了如何将我们的自我监督方法推广到360度视频的阿姆比索音频。