包装和检测:在使用利益区包装的视频中快速物体探测 (Pack and Detect: Fast Object Detection in Videos Using Region-of-Interest Packing)

Object detection in videos is an important task in computer vision for various applications such as object tracking, video summarization and video search. Although great progress has been made in improving the accuracy of object detection in recent years due to improved techniques for training and deploying deep neural networks, they are computationally very intensive. For example, processing a video at 300x300 resolution using the SSD300 (Single Shot Detector) object detection network with VGG16 as backbone at 30 fps requires 1.87 trillion FLOPS/s. In order to address this challenge, we make two important observations in the context of videos. In some scenarios, most of the regions in a video frame are background and the salient objects occupy only a small fraction of the area in the frame. Further, in a video, there is a strong temporal correlation between consecutive frames. Based on these observations, we propose Pack and Detect (PaD) to reduce the computational requirements for the task of object detection in videos using neural networks. In PaD, the input video frame is processed at full size in selected frames called anchor frames. In the frames between the anchor frames, namely inter-anchor frames, the regions of interest(ROI) are identified based on the detections in the previous frame. We propose an algorithm to pack the ROI's of each inter-anchor frame together in a lower sized frame. In order to maintain the accuracy of object detection, the proposed algorithm expands the ROI's greedily to provide more background information to the detector. The computational requirements are reduced due to the lower size of the input. This method can potentially reduce the number of FLOPS required for a frame by 4x. Tuning the algorithm parameters can provide a 1.3x increase in throughput with only a 2.5% drop in accuracy.

翻译：视频中的物体检测是计算机视觉中各种应用程序的重要任务, 如对象跟踪、视频摘要和视频搜索。尽管近年来由于培训和部署深神经网络的技术改进,在提高物体检测准确性方面取得了很大进展, 但它们在计算上非常密集。例如, 使用 SSD300 (Singshot 检测器) 的300x300 分辨率处理视频, 使用 VGG16 的天体检测网络作为主干线需要1.87万亿 FLOPS/s 。为了应对这一挑战, 我们在视频框架中进行两项重要观测。在某些情景中, 视频框架中的大多数区域都是背景, 而突出对象的天体只占据框架中的一小部分区域。此外, 在视频中, 连续框架之间有很强的时间相关性。基于这些观察, 我们建议使用 Pack and 检测(PAD) 来减少视频中天体检测任务所需的计算要求。在 PaD 中, 输入视频框中, 将在选定的定位框中进行完整处理。在轨迹框中, 大多数区域中, 即 2.5 内部检测框架中, 内部检测值值值值值值值值值值值值值值值值值值中, 将每组的直线框中, 缩算框架内, 递值中, 递减一个直线框中, 递算法框架内, 向前的计算法框架内, 向前的计算法框架内, 向前的递算法中, 向后算法中, 递算法区域。