In this paper, Gated-ViGAT, an efficient approach for video event recognition, utilizing bottom-up (object) information, a new frame sampling policy and a gating mechanism is proposed. Specifically, the frame sampling policy uses weighted in-degrees (WiDs), derived from the adjacency matrices of graph attention networks (GATs), and a dissimilarity measure to select the most salient and at the same time diverse frames representing the event in the video. Additionally, the proposed gating mechanism fetches the selected frames sequentially, and commits early-exiting when an adequately confident decision is achieved. In this way, only a few frames are processed by the computationally expensive branch of our network that is responsible for the bottom-up information extraction. The experimental evaluation on two large, publicly available video datasets (MiniKinetics, ActivityNet) demonstrates that Gated-ViGAT provides a large computational complexity reduction in comparison to our previous approach (ViGAT), while maintaining the excellent event recognition and explainability performance. Gated-ViGAT source code is made publicly available at https://github.com/bmezaris/Gated-ViGAT
翻译:在本文中,Gated-ViGAT是一种有效的视频事件识别方法,它利用自下而上(对象)信息,提出了新的框架抽样政策和定位机制,具体地说,框架抽样政策使用了从图形关注网络的相邻矩阵(GATs)中得出的加权量度(WiDs),以及选择最突出且同时代表视频中事件的不同框架的不同度度量。此外,拟议的定位机制按顺序获取选定的框架,并在作出具有充分信心的决定时进行早期应用。在这种方式中,只有少数框架是由负责自下而上信息提取的我们网络中计算费用昂贵的分支处理的。对两个大、公开的视频数据集(MiniKinetics,活动网)的实验性评估表明,Gated-ViGAT提供了与我们先前方法(ViGAT)相比的计算复杂性大幅降低,同时保持了出色的事件识别和解释性。Ged-ViGAT源码在https://gthub.com/bmasari/gmazari/Gratatedations)上公开提供。