This paper presents LiteEval, a simple yet effective coarse-to-fine framework for resource efficient video recognition, suitable for both online and offline scenarios. Exploiting decent yet computationally efficient features derived at a coarse scale with a lightweight CNN model, LiteEval dynamically decides on-the-fly whether to compute more powerful features for incoming video frames at a finer scale to obtain more details. This is achieved by a coarse LSTM and a fine LSTM operating cooperatively, as well as a conditional gating module to learn when to allocate more computation. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet, and the results demonstrate LiteEval requires substantially less computation while offering excellent classification accuracy for both online and offline predictions.
翻译:本文介绍LiteEval, 这是一个简单而有效的用于资源高效视频识别的简单而有效的粗皮到软皮框架,适合于在线和离线情景。 利用光量CNN模型以粗皮规模生成的体面但计算效率高的功能,LiteEval动态地决定是否以更细的尺度计算更强大的图像框架功能以获取更多细节。 这是通过一个粗皮LSTM和精细LSTM合作操作实现的,以及一个有条件的格子模块,以学习何时分配更多的计算。 对两个大型视频基准FCVID和ActionNet进行了广泛的实验,结果显示LiteEval需要大大降低计算数量,同时为在线和离线预测提供极好的分类准确性。