Temporal action localization is an important step towards video understanding. Most current action localization methods depend on untrimmed videos with full temporal annotations of action instances. However, it is expensive and time-consuming to annotate both action labels and temporal boundaries of videos. To this end, we propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training. We propose a classification module to generate action labels for each segment in the video, and a deep metric learning module to learn the similarity between different action instances. We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm. Extensive experiments demonstrate the effectiveness of both of these components in temporal localization. We evaluate our algorithm on two challenging untrimmed video datasets: THUMOS14 and ActivityNet1.2. Our approach improves the current state-of-the-art result for THUMOS14 by 6.5% mAP at IoU threshold 0.5, and achieves competitive performance for ActivityNet1.2.
翻译:时间行动本地化是了解视频的一个重要步骤。 多数当前行动本地化方法取决于未剪辑的视频,并附有行动实例的全部时间说明。 但是,对动作标签和视频的时间界限进行批注既费钱又费时。 为此,我们提议了一种监管不力的时间行动本地化方法,仅要求视频一级行动实例作为培训过程中的监管。 我们提议了一个分类模块,为视频中每个部分生成动作标签,以及一个深厚的衡量学习模块,以了解不同动作实例之间的相似性。 我们共同优化平衡的双倍跨植物损失和指标性损失,使用标准的回映算法。 广泛的实验表明这两个组成部分在时间本地化中的有效性。 我们评估了我们两个挑战性强的未纹视频数据集的算法: THUMOS14 和活动Net1.2。 我们的方法在IoU 门槛 0.5 上将目前THUMOS14 的状态结果提高6.5% mAP, 并实现活动Net1.2 的竞争性性表现。