We present AdaFrame, a framework that adaptively selects relevant frames on a per-input basis for fast video recognition. AdaFrame contains a Long Short-Term Memory network augmented with a global memory that provides context information for searching which frames to use over time. Trained with policy gradient methods, AdaFrame generates a prediction, determines which frame to observe next, and computes the utility, i.e., expected future rewards, of seeing more frames at each time step. At testing time, AdaFrame exploits predicted utilities to achieve adaptive lookahead inference such that the overall computational costs are reduced without incurring a decrease in accuracy. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and AvtivityNet. AdaFrame matches the performance of using all frames with only 8.21 and 8.65 frames on FCVID and AvtivityNet, respectively. We further qualitatively demonstrate learned frame usage can indicate the difficulty of making classification decisions; easier samples need fewer frames while harder ones require more, both at instance-level within the same class and at class-level among different categories.
翻译:我们介绍AdaFrame, 这个框架以适应性方式为每个投入为基础选择相关框架,以便快速视频识别。 AdaFrame 包含一个长短期内存网络,并辅之以一个全球内存,为长期使用的框架提供搜索背景信息。 AdaFrame 使用政策梯度方法培训了预测,确定了下一个观察框架,并计算了每一步看到更多框架的效用,即预期未来回报。在测试时,AdaFrame 利用预测的公用事业实现适应性直观的推断,这样总体计算成本会降低,而不会降低准确性。在两个大型视频基准,即FCVID和AvticivityNet上进行了广泛的实验。AdaFrame将使用所有框架的性能分别与FCVID和AvtiversityNet上8.21和8.65框架相匹配。我们进一步定性地展示所学框架的使用情况,可以表明作出分类决定的困难;比较容易的样本需要更少的框架,而更难的样本需要更多,在同一类和不同类别之间的实例一级。