Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only capture pixel-level knowledge rather than spatiotemporal semantics, which hinders further progress in video understanding. Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. We introduce a new pretext task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR scripts by attending to learned video representations. We do not rely on descriptive captions and learn purely from video, i.e., leveraging the natural transcribed speech knowledge to provide noisy but useful semantics over time. Our method enforces the vision model to contextualize what is happening over time so that it can re-organize the narrative transcripts, and can seamlessly apply to large-scale uncurated video data in the real world. Our method demonstrates strong out-of-the-box spatiotemporal representations on diverse benchmarks, e.g., +13.6% gains over VideoMAE on SSV2 via linear probing. The code is available at https://github.com/TencentARC/TVTS.
翻译:大规模视频数据培训前已成为近年来学习可转移的时空表达方式的常见秘诀。尽管取得了一些进展,但现有方法大多仅限于高度整理的数据集(如K400),并显示出不尽如人意的框外表达方式。我们争辩说,这是因为这些方法只捕捉像素级知识,而不是随机随机语义学,这阻碍了视频理解的进一步发展。受图像文本预培训(如CLIP)取得巨大成功的影响,我们迈出了第一步,利用语言语义学来推动可转移的时空表达方式学习。我们引入了一个新的借口任务,即转而转而转而进行剪辑式整理(TVTTTS),通过参加学习的视频演示,使ASR的脚本变得令人不快。我们并不依赖描述性说明,而纯粹从视频学,即利用自然流传的语音知识来提供噪音,但有用的语义学。我们的方法在时间上将正在发生的事情背景化的图像模型应用到可转移的瞬间空间表达式表达方式上,可以将我们的数据记录方式重新展示。我们现有的记录式记录记录式方法。</s>