与深革命神经神经网络的视频分类 (Rate-Accuracy Trade-Off In Video Classification With Deep Convolutional Neural Networks)

Advanced video classification systems decode video frames to derive the necessary texture and motion representations for ingestion and analysis by spatio-temporal deep convolutional neural networks (CNNs). However, when considering visual Internet-of-Things applications, surveillance systems and semantic crawlers of large video repositories, the video capture and the CNN-based semantic analysis parts do not tend to be co-located. This necessitates the transport of compressed video over networks and incurs significant overhead in bandwidth and energy consumption, thereby significantly undermining the deployment potential of such systems. In this paper, we investigate the trade-off between the encoding bitrate and the achievable accuracy of CNN-based video classification models that directly ingest AVC/H.264 and HEVC encoded videos. Instead of retaining entire compressed video bitstreams and applying complex optical flow calculations prior to CNN processing, we only retain motion vector and select texture information at significantly-reduced bitrates and apply no additional processing prior to CNN ingestion. Based on three CNN architectures and two action recognition datasets, we achieve 11%-94% saving in bitrate with marginal effect on classification accuracy. A model-based selection between multiple CNNs increases these savings further, to the point where, if up to 7% loss of accuracy can be tolerated, video classification can take place with as little as 3 kbps for the transport of the required compressed video information to the system implementing the CNN models.

翻译：高级视频分类系统解码了视频框架,以便通过时深神经神经网络(CNNs)对摄取和分析进行必要的纹理和运动表达,通过时空深神经网络(CNNs)进行摄取和分析。然而,在考虑直录AVC/H.264和HEVC编码视频的视频互联网应用、监视系统和大型视频储存库的静态爬行器时,视频捕捉和CNN的语义分析部分往往不会合用同一地点。这就需要将压缩视频传送到网络之上,并造成带宽和能源消耗方面的大量间接费用,从而大大削弱这些系统的部署潜力。在本文中,我们调查了基于CNN的编码位元率和基于CNN的视频分类模型之间的权衡和可实现的准确性。我们通过视频存储整个压缩版流体流和在CNN处理之前应用复杂的光学流计算,我们只保留运动矢量和选择文本信息,并且在CNN摄取前不做额外的处理。根据三个CNN的架构和两个动作识别数据集,我们实现了11-94%的视频分类模型,如果将视频的精确度提升到视频分类,那么,那么,可以将这些视频的精确度在视频分类中,可以将视频转换为BR的比值中,可以进一步定位,从而将视频转换为MISNCNCM的精确度提升为BR的比值。