Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding and significant progress has been made. Previous methods involve multiple stages or networks and hand-designed rules or operations, which fall short in efficiency and flexibility. In this paper, we propose an end-to-end framework for TAD upon Transformer, termed \textit{TadTR}, which maps a set of learnable embeddings to action instances in parallel. TadTR is able to adaptively extract temporal context information required for making action predictions, by selectively attending to a sparse set of snippets in a video. As a result, it simplifies the pipeline of TAD and requires lower computation cost than previous detectors, while preserving remarkable detection performance. TadTR achieves state-of-the-art performance on HACS Segments (+3.35% average mAP). As a single-network detector, TadTR runs 10$\times$ faster than its comparable competitor. It outperforms existing single-network detectors by a large margin on THUMOS14 (+5.0% average mAP) and ActivityNet (+7.53% average mAP). When combined with other detectors, it reports 54.1% mAP at IoU=0.5 on THUMOS14, and 34.55% average mAP on ActivityNet-1.3. Our code will be released at \url{https://github.com/xlliu7/TadTR}.
翻译:时间动作检测 (TAD) 旨在确定语义标签和每个行动实例的界限。 这是在视频理解中的一项根本性且具有挑战性的任务, 并且已经取得了显著的进展。 以往的方法涉及多个阶段或网络和手工设计的规则或操作, 效率和灵活性都不足。 在本文中, 我们提议在变换器上为 TAD 建立一个端到端的框架, 名为\ textit{ TadTR}, 这个框架将一组可学习的嵌入到平行的行动实例中。 TadTR 能够通过有选择地提取行动预测所需的时间背景信息, 在视频中选取一组稀少的片段。 结果, 它简化了TAD的管道或网络和手工设计的规则或操作, 在保持显著的检测性能的同时, TadTRTAD在HCS 片段( + 3. 3.5% 平均 mAP ) 上, 在单网络探测器上运行10美元\ times, 在可比的比较器上运行10美元。 它超越了在54. ASO+ mAV 平均% 上的现有单网域域域域域中, 在一个平均的 mAP 服务器上, 在5O+% 中, 在一个平均的