DOAD: Decoupled One Stage Action Detection Network (DOAD: Decoupled One Stage Action Detection Network)

Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding. Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition. However, such two-stage methods are generally with low efficiency. We observe that directly unifying detection and action recognition normally suffers from (i) inferior learning due to different desired properties of context representation for detection and action recognition; (ii) optimization difficulty with insufficient training data. In this work, we present a decoupled one-stage network dubbed DOAD, to mitigate above issues and improve the efficiency for spatio-temporal action detection. To achieve it, we decouple detection and action recognition into two branches. Specifically, one branch focuses on detection representation for actor detection, and the other one for action recognition. For the action branch, we design a transformer-based module (TransPC) to model pairwise relationships between people and context. Different from commonly used vector-based dot product in self-attention, it is built upon a novel matrix-based key and value for Hadamard attention to model person-context information. It not only exploits relationships between person pairs but also takes into account context and relative position information. The results on AVA and UCF101-24 datasets show that our method is competitive with two-stage state-of-the-art methods with significant efficiency improvement.

翻译：DOAD：解耦的单阶段动作检测网络从视频中定位人并识别其动作是实现高级视频理解的一项具有挑战性的任务。现有方法主要是基于两阶段进行操作的，第一阶段用于生成人的边界框，第二阶段用于动作识别。然而，这种两阶段方法通常效率较低。我们观察到，直接将检测和动作识别融合在一起通常会遇到以下问题：（i）学习效果较差，原因在于检测和动作识别的上下文表示具有不同的期望属性；（ii）训练数据不足时，优化难度大。在本文中，我们提出一种名为DOAD的解耦式单阶段网络，以减轻上述问题并提高时空动作检测的效率。为实现这一目标，我们将检测和动作识别解耦为两个分支。具体地，一个分支专注于演员检测的检测表示，而另一个分支专注于动作识别。针对动作分支，我们设计了一个基于 Transformer 的模块（TransPC），以模拟人与上下文之间的成对关系。与自注意力中常用的基于向量的点积不同，它基于一种新颖的基于矩阵的键和值用于 Hadamard 注意力，以模拟人-上下文信息。它不仅利用了人群之间的关系，还考虑了上下文和相对位置信息。在AVA和UCF101-24数据集上的结果表明，我们的方法与两阶段最先进的方法竞争力相当，且具有显著的效率提升。