Fine-grained action detection is an important task with numerous applications in robotics, human-computer interaction, and video surveillance. Several existing methods use the popular two-stream approach, which learns the spatial and temporal information independently from one another. Additionally, the temporal stream of the model usually relies on extracted optical flow from the video stream. In this work, we propose a deep learning model to jointly learn both spatial and temporal information without the necessity of optical flow. We also propose a novel convolution, namely locally-consistent deformable convolution, which enforces a local coherency constraint on the receptive fields. The model produces short-term spatio-temporal features, which can be flexibly used in conjunction with other long-temporal modeling networks. The proposed features used in conjunction with a major state-of-the-art long-temporal model ED-TCN outperforms the original ED-TCN implementation on two fine-grained action datasets: 50 Salads and GTEA, by up to 10.0% and 4.3%, and also outperforms the recent state-of-the-art TDRN, by up to 5.9% and 2.6%.
翻译:在机器人、人机互动和视频监视方面,精细的动作探测是一项重要任务,涉及许多应用。一些现有方法使用流行的双流方法,分别学习空间和时间信息。此外,模型的时流通常依赖于从视频流中提取的光学流。在这项工作中,我们提出了一个深层次的学习模式,以共同学习空间和时间信息,而无需光流。我们还提议了一个新颖的演进模式,即地方上一致的变异式共振,对接收字段施加当地一致性限制。该模型产生短期的时空特征,可与其他长期模型网络灵活使用。与主要状态的长时空模型ED-TCN一起使用的拟议特征超越了在两个微细分解的行动数据集上最初的ED-TCN执行系统:50 Salads和GTEA,最高达到10.0%和4.3%,还超越了最近状态的TDR%和2.6%的TDRN。