当地-全球了解当地-全球背景语言引导视频路段变革器 (Local-Global Context Aware Transformer for Language-Guided Video Segmentation)

We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components -- one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S+ show that Locater outperforms previous state-of-the-arts. Further, our Locater based solution achieved the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge. Our code and dataset are available at: https://github.com/leonnnop/Locater

翻译：我们探索语言引导视频分割( LVS ) 的任务。以前的算法大多采用 3D CNN 来学习视频表达方式, 努力捕捉长期背景, 容易受视觉语言不匹配的影响。有鉴于此, 我们展示Charder( 本地- 全球背景了解变异器), 以有限的内存来增强变异器结构, 以便以高效的方式用语言表达方式查询整个视频。记忆的设计涉及两个组成部分 -- -- 一个是持续保存全球视频内容,另一个是动态收集本地时间背景和分解历史。根据当地- 全球背景和每个框架的特定直径的内涵, 努力捕捉到长期背景, 并努力捕捉到长期背景的可适应性查询矢量。内存还允许Charler以线性时间复杂性和恒定大小内存储整个视频, 而变异性自我自控的天文缩缩缩比例, 我们的LVS+ Rioral- develrial S- dal- dreal labS- dreal Streal Streal labs lags- weal Sl SlabSl Slax Sl Sl Sl Sl Slabs) 。 3 Slax Sl Sl Sl Sl- ds ds ds dsl labs laxs ds ds ds lax lads lad Slad Sladal lad Sl lax ladal lad Std Stds

相关内容

LVS

关注 0

LVS （Linux虚拟服务器） LVS集群采用IP负载均衡技术和基于内容请求分发技术。调度器具有很好的吞吐率，将请求均衡地转移到不同的服务器上执行，且调度器自动屏蔽掉服务器的故障，从而将一组服务器构成一个高性能的、高可用的虚拟服务器。整个服务器集群的结构对客户是透明的，而且无需修改客户端和服务器端的程序。为此，在设计时需要考虑系统的透明性、可伸缩性、高可用性和易管理性。

【北京大学】Locally Differentially Private (Contextual) Bandits Learning

专知会员服务

13+阅读 · 2020年6月8日

【CVPR2020】实例感知、上下文聚焦和内存有效的弱监督目标检测，Instance-aware, Context-focused, and Memory-efficient Weakly Supervised Object Detection

专知会员服务

34+阅读 · 2020年4月11日

CVPR 2020 论文开源项目合集

专知会员服务

110+阅读 · 2020年3月12日

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

专知会员服务

92+阅读 · 2019年12月22日