Understanding accurate information on human behaviours is one of the most important tasks in machine intelligence. Human Activity Recognition that aims to understand human activities from a video is a challenging task due to various problems including background, camera motion and dataset variations. This paper proposes two CNN based architectures with three streams which allow the model to exploit the dataset under different settings. The three pathways are differentiated in frame rates. The single pathway, operates at a single frame rate captures spatial information, the slow pathway operates at low frame rates captures the spatial information and the fast pathway operates at high frame rates that capture fine temporal information. Post CNN encoders, we add bidirectional LSTM and attention heads respectively to capture the context and temporal features. By experimenting with various algorithms on UCF-101, Kinetics-600 and AVA dataset, we observe that the proposed models achieve state-of-art performance for human action recognition task.
翻译:了解关于人类行为的准确信息是机器智能中最重要的任务之一。人类活动认识旨在从视频中了解人类活动是一项艰巨的任务,因为各种问题,包括背景、摄影机动作和数据集的变化。本文件提议了两个有线电视新闻网的建筑结构,其中有三个流,使模型能够在不同的环境下利用数据集。三个路径在框架率上有区别。单一路径以单一框架速率运行,捕捉空间信息,低框架速率运行缓慢路径捕捉空间信息,快速路径以高框架速率运行,捕捉精细的时空信息。在CNN编码器后,我们添加双向LSTM和关注头,分别捕捉上下文和时间特征。我们通过实验UCF-101、Kinetics-600和AVA数据集上的各种算法,我们观察到拟议模型在人类行动识别任务上达到了最先进的性能。