Given an untrimmed video and a language query depicting a specific temporal moment in the video, video grounding aims to localize the time interval by understanding the text and video simultaneously. One of the most challenging issues is an extremely time- and cost-consuming annotation collection, including video captions in a natural language form and their corresponding temporal regions. In this paper, we present a simple yet novel training framework for video grounding in the zero-shot setting, which learns a network with only video data without any annotation. Inspired by the recent language-free paradigm, i.e. training without language data, we train the network without compelling the generation of fake (pseudo) text queries into a natural language form. Specifically, we propose a method for learning a video grounding model by selecting a temporal interval as a hypothetical correct answer and considering the visual feature selected by our method in the interval as a language feature, with the help of the well-aligned visual-language space of CLIP. Extensive experiments demonstrate the prominence of our language-free training framework, outperforming the existing zero-shot video grounding method and even several weakly-supervised approaches with large margins on two standard datasets.
翻译:视频定位的目的是通过同时理解文本和视频,使时间间隔本地化。 最具挑战性的问题之一是极其耗时和费钱的批注,包括自然语言形式的视频字幕及其相应的时间区。 在本文中,我们提出了一个简单而新颖的视频定位培训框架,用于在零发环境中进行视频定位,在不作任何批注的情况下学习仅包含视频数据的网络。在最新的无语言培训模式(即没有语言数据的培训)的启发下,我们培训网络时,不强迫生成假文本查询,而将假文本查询转化为自然语言形式。具体地说,我们提出一种方法,通过选择一个时间间隔作为假设正确的答案来学习视频定位模型,并将我们的方法在间隔中选择的视觉特征作为语言特征,在完全一致的CLIP视觉语言空间的帮助下,将我们的无语言培训框架置于突出位置,在两种标准数据边距上,甚至若干次弱的超强的超固方法。