Zero-shot video classification for fine-grained activity recognition has largely been explored using methods similar to its image-based counterpart, namely by defining image-derived attributes that serve to discriminate among classes. However, such methods do not capture the fundamental dynamics of activities and are thus limited to cases where static image content alone suffices to classify an activity. For example, reversible actions such as entering and exiting a car are often indistinguishable. In this work, we present a framework for straightforward modeling of activities as a state machine of dynamic attributes. We show that encoding the temporal structure of attributes greatly increases our modeling power, allowing us to capture action direction, for example. Further, we can extend this to activity detection using dynamic programming, providing, to our knowledge, the first example of zero-shot joint segmentation and classification of complex action sequences in a larger video. We evaluate our method on the Olympic Sports dataset where our model establishes a new state of the art for standard zero-shot-learning (ZSL) evaluation as well as outperforming all other models in the inductive category for general (GZSL) zero-shot evaluation. Additionally, we are the first to demonstrate zero-shot decoding of complex action sequences on a widely used surgical dataset. Lastly, we show that that we can even eliminate the need to train attribute detectors by using off-the-shelf object detectors to recognize activities in challenging surveillance videos.
翻译:微微成形活动识别的零点视频分类方法与图像对应方法相似,即确定基于图像的属性,从而导致各阶层之间的差别,因此,这些方法不能够捕捉活动的基本动态,因此仅限于静态图像内容仅能足以对活动进行分类的情况。例如,进、出汽车等可逆行动往往无法区分。在这项工作中,我们提出了一个框架,将活动简单建模为动态属性的国家机器。我们显示,将属性的时间结构编码将大大增强我们的模型能力,使我们能够捕捉行动方向。此外,我们可以将这种方法扩大到利用动态程序对活动进行探测,向我们提供在更大视频中对复杂行动序列进行零点联合分解和分类的第一个实例。我们评估奥林匹克体育数据集的方法,我们的模型为标准零点学习(ZSL)评估建立了新状态,以及超越了一般(GZSL)直观目标类别中的所有其他模型,使我们能够通过动态的零镜头评估来大规模地展示一个具有挑战性的图像序列。我们用“革命”的图像来展示我们所使用的数据。