自我监督的多模态表征学习无需视频和文本数据对齐的可扩展精准性 (Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data)

Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack of diversity in the data, and the difficulty of collecting aligned data. Currently popular video-text data mining approach via automatic speech recognition (ASR) used in HowTo100M provides low-quality captions that often do not refer to the video content. Other mining approaches do not provide proper language descriptions (video tags) and are biased toward short clips (alt text). In this work, we show how recent advances in image captioning allow us to pre-train high-quality video models without any parallel video-text data. We pre-train several video captioning models that are based on an OPT language model and a TimeSformer visual backbone. We fine-tune these networks on several video captioning datasets. First, we demonstrate that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions. Second, we show that pre-training on both images and videos produces a significantly better network (+4 CIDER on MSR-VTT) than pre-training on a single modality. Our methods are complementary to the existing pre-training or data mining approaches and can be used in a variety of settings. Given the efficacy of the pseudolabeling method, we are planning to publicly release the generated captions.

翻译：显著推进最新计算机视觉和多模态神经网络的绝大部分原因是通过弱监督数据集的扩大。然而，现有的大规模视频-文本数据集和挖掘技术存在多种限制，例如数据对齐数据的稀缺性、数据缺乏多样性以及收集对齐数据的困难。目前在HowTo100M中使用的基于自动语音识别（ASR）的视频-文本数据挖掘方法提供了质量低下的字幕，这些字幕常常与视频内容无关。其他数据挖掘方法则没有提供适当的语言描述（视频标记）且偏向短片段（alt文本）。在这项工作中，我们展示了最近在图像字幕标注方面的进展，使我们能够在没有任何平行视频-文本数据的情况下预训练高质量视频模型。我们预训练了几个视频字幕模型，这些模型基于OPT语言模型和TimeSformer视觉骨干。我们在几个视频字幕数据集上对这些网络进行微调。首先，我们证明了图像字幕伪标签对预训练来说比现有的HowTo100M ASR字幕更有效。其次，我们展示了在图像和视频上进行预训练比单模态预训练产生了显著更好的网络（MSR-VTT上+4 CIDER）。我们的方法是现有预训练或数据挖掘方法的有益补充，可以在各种场景下使用。考虑到伪标注方法的效果，我们计划公开发布生成的字幕。