Text-based video segmentation aims to segment an actor in video sequences by specifying the actor and its performing action with a textual query. Previous methods fail to explicitly align the video content with the textual query in a fine-grained manner according to the actor and its action, due to the problem of \emph{semantic asymmetry}. The \emph{semantic asymmetry} implies that two modalities contain different amounts of semantic information during the multi-modal fusion process. To alleviate this problem, we propose a novel actor and action modular network that individually localizes the actor and its action in two separate modules. Specifically, we first learn the actor-/action-related content from the video and textual query, and then match them in a symmetrical manner to localize the target tube. The target tube contains the desired actor and action which is then fed into a fully convolutional network to predict segmentation masks of the actor. Our method also establishes the association of objects cross multiple frames with the proposed temporal proposal aggregation mechanism. This enables our method to segment the video effectively and keep the temporal consistency of predictions. The whole model is allowed for joint learning of the actor-action matching and segmentation, as well as achieves the state-of-the-art performance for both single-frame segmentation and full video segmentation on A2D Sentences and J-HMDB Sentences datasets.
翻译:以文字为基础的视频截断法旨在通过指定演员及其以文字查询的方式执行动作,在视频序列中分割一个演员。 以往的方法没有根据演员及其行动,根据演员及其行动,以细微的区分方式将视频内容与文本查询明确一致, 原因是: emph{ semantic 不对称 。 目标管包含理想的演员和行动, 然后输入到一个完全进化网络以预测演员的分解面罩。 我们的方法还确定了目标跨多个框架与拟议时间建议汇总机制的关联。 这样我们的方法可以将视频及其行动分别地分解成两个不同的模块。 具体而言, 我们首先从视频和文字查询中学习与演员/行动有关的内容, 然后以对称的方式匹配它们, 使目标管本地化。 目标管包含理想的演员和行动, 然后被反馈到一个完全进化的动态网络中, 以预测演员的分层。 我们的方法还确定了物体跨多个框架与拟议的时间建议组合机制的关联。 这使得我们的方法能够有效地分割视频, 保持预测的时间一致性。 具体而言, 我们先从视频和文本查询中学习, 然后将整个视频部分与J- 部分与J- shedededeal 匹配。