Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.
翻译:视频文本检索(VTR)对于多模式理解来说是一项具有吸引力但富有挑战性的任务,它旨在搜索相关的视频(文本),并给出一个查询(视频)。 现有方法通常使用完全多样化的视觉文本信息来对视频和文本进行调和,同时缺乏对两种模式中均匀高层次语义信息的认识。 为了填补这一空白,我们在这项工作中提出了名为 VTR 的 HISE 的新型视觉语言匹配模型,通过纳入明确的高层次语义来改进跨模式的表述。 首先,我们探索明确的高级语义(文本)的等级属性,并进一步将其分解为两个层次,即离散语义和整体语义。具体地说,我们利用一个现成的高层次语义实体预测器来生成离散的高层次语义。同时,一个经过培训的视频字幕描述模型用于输出整体高层次语义的语义。关于文本模式,我们将文本的文字属性分析分为三个部分,包括发生、行动和实体之间的分级高级语义和分级逻辑。 具体地说,事件与整个高层次的数学解释方法对应了我们所使用的高层次的高级标准,,包括了使用过层次的数学和直层次的数学分析。