We introduce an audiovisual method for long-range text-to-video retrieval. Unlike previous approaches designed for short video retrieval (e.g., 5-15 seconds in duration), our approach aims to retrieve minute-long videos that capture complex human actions. One challenge of standard video-only approaches is the large computational cost associated with processing hundreds of densely extracted frames from such long videos. To address this issue, we propose to replace parts of the video with compact audio cues that succinctly summarize dynamic audio events and are cheap to process. Our method, named ECLIPSE (Efficient CLIP with Sound Encoding), adapts the popular CLIP model to an audiovisual video setting, by adding a unified audiovisual transformer block that captures complementary cues from the video and audio streams. In addition to being 2.92x faster and 2.34x memory-efficient than long-range video-only approaches, our method also achieves better text-to-video retrieval accuracy on several diverse long-range video datasets such as ActivityNet, QVHighlights, YouCook2, DiDeMo and Charades.
翻译:我们引入了远程文本到视频检索的视听方法。 与以往设计的短视频检索方法不同( 持续时间为5-15秒), 我们的方法旨在检索记录复杂人类行动的短视视频。 标准的只视频方法的一个挑战在于处理如此长的视频中数百个密集提取框所涉及的大量计算成本。 为了解决这个问题, 我们建议用精密的音频提示取代视频的部分内容, 简洁地总结动态音频事件, 并且很便宜地处理。 我们的方法, 名为 ECLIPSE (有声音编码的快速 CLIP), 将流行的 CLIP 模型调整为视频设置, 方法是添加一个统一的视听变压器块, 捕捉视频和音频流的互补提示 。 除了2. 92x 更快和2.34x 记忆效率高于远程视频只读取方法之外, 我们的方法还在多个远程视频数据集, 如 ActionNet、 QVighights、 YouCook2、 DiDeMo和Charades 上实现更好的文本到视频检索准确性。