剪辑工作:一个高效物体探测和视频流分割的补补式关注网络 (Patchwork: A Patch-wise Attention Network for Efficient Object Detection and Segmentation in Video Streams)

Recent advances in single-frame object detection and segmentation techniques have motivated a wide range of works to extend these methods to process video streams. In this paper, we explore the idea of hard attention aimed for latency-sensitive applications. Instead of reasoning about every frame separately, our method selects and only processes a small sub-window of the frame. Our technique then makes predictions for the full frame based on the sub-windows from previous frames and the update from the current sub-window. The latency reduction by this hard attention mechanism comes at the cost of degraded accuracy. We made two contributions to address this. First, we propose a specialized memory cell that recovers lost context when processing sub-windows. Secondly, we adopt a Q-learning-based policy training strategy that enables our approach to intelligently select the sub-windows such that the staleness in the memory hurts the performance the least. Our experiments suggest that our approach reduces the latency by approximately four times without significantly sacrificing the accuracy on the ImageNet VID video object detection dataset and the DAVIS video object segmentation dataset. We further demonstrate that we can reinvest the saved computation into other parts of the network, and thus resulting in an accuracy increase at a comparable computational cost as the original system and beating other recently proposed state-of-the-art methods in the low latency range.

翻译：单框架天体探测和分割技术的最新进展促使了一系列广泛的工程,将这些方法扩大到处理视频流。在本文件中,我们探索了对潜伏敏感应用的严格关注的构想。我们的方法没有分别对每个框架进行推理,而是选择和只处理框架的小型子窗口。我们的方法然后根据前框架的次窗口和当前次窗口的更新,对整个框架作出预测。这个硬关注机制的延缓度降低以降低精确度为代价。我们为此作出了两项贡献。首先,我们提出了一个专门的记忆单元格,在处理次窗口时恢复失去的环境。其次,我们采用了基于学习的政策培训战略,使我们能够明智地选择框架的子窗口。我们的技术然后根据前框架的次窗口和当前次窗口的更新,对整个框架进行了预测。我们的实验表明,我们的方法将宽度降低大约四倍,但不会大大降低图像网 VID 视频对象探测数据集和 DAVIS 视频对象断段的精确度。我们进一步展示了基于Q 学习的学习政策培训策略, 使得我们能够将其它的原始的精确度计算方法升级到最新的网络, 。