Online action detection has attracted increasing research interests in recent years. Current works model historical dependencies and anticipate future to perceive the action evolution within a video segment and improve the detection accuracy. However, the existing paradigm ignores category-level modeling and does not pay sufficient attention to efficiency. Considering a category, its representative frames exhibit various characteristics. Thus, the category-level modeling can provide complementary guidance to the temporal dependencies modeling. In this paper, we develop an effective exemplar-consultation mechanism that first measures the similarity between a frame and exemplary frames, and then aggregates exemplary features based on the similarity weights. This is also an efficient mechanism as both similarity measurement and feature aggregation require limited computations. Based on the exemplar-consultation mechanism, the long-term dependencies can be captured by regarding historical frames as exemplars, and the category-level modeling can be achieved by regarding representative frames from a category as exemplars. Due to the complementarity from the category-level modeling, our method employs a lightweight architecture but achieves new high performance on three benchmarks. In addition, using a spatio-temporal network to tackle video frames, our method spends 9.8 seconds to dispose of a one-minute video and achieves comparable performance.
翻译:近年来,在线行动探测吸引了越来越多的研究兴趣。目前的工作模式是历史依赖性,并预测未来,以在视频段内观察行动演变情况,提高检测准确性。然而,现有的范例忽略了类别层面的建模,对效率没有给予足够的重视。考虑到一个类别,其代表性框架具有各种特点。因此,类别级建模可以为时间依赖性建模提供补充性指导。在本文件中,我们开发了一个有效的模拟咨询机制,首先衡量框架和示范框架之间的相似性,然后根据相似性权重来汇总示范性特征。这也是一个高效机制,因为相似性测量和特征汇总都需要有限的计算。根据外部磋商机制,长期依赖性可以通过历史框架作为例外性模型来捕捉,而类别级建模可以通过从一个类别中代表性框架作为示范性标本实现。由于类别建模的互补性,我们的方法使用一个轻度结构,但基于相似性权重实现三个基准的新高性。此外,使用一个可比较性平面图像网络,利用一个可变的视像框来控制我们9秒钟。