Video event localization tasks include temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods tend to over-specialize on individual tasks, neglecting the equal importance of these different events for a complete understanding of video content. In this work, we aim to develop a unified framework to solve TAL, SED and AVEL tasks together to facilitate holistic video understanding. However, it is challenging since different tasks emphasize distinct event characteristics and there are substantial disparities in existing task-specific datasets (size/domain/duration). It leads to unsatisfactory results when applying a naive multi-task strategy. To tackle the problem, we introduce UniAV, a Unified Audio-Visual perception network to effectively learn and share mutually beneficial knowledge across tasks and modalities. Concretely, we propose a unified audio-visual encoder to derive generic representations from multiple temporal scales for videos from all tasks. Meanwhile, task-specific experts are designed to capture the unique knowledge specific to each task. Besides, instead of using separate prediction heads, we develop a novel unified language-aware classifier by utilizing semantic-aligned task prompts, enabling our model to flexibly localize various instances across tasks with an impressive open-set ability to localize novel categories. Extensive experiments demonstrate that UniAV, with its unified architecture, significantly outperforms both single-task models and the naive multi-task baseline across all three tasks. It achieves superior or on-par performances compared to the state-of-the-art task-specific methods on ActivityNet 1.3, DESED and UnAV-100 benchmarks.
翻译:视频事件定位任务包括时序动作定位(TAL)、声音事件检测(SED)以及视听事件定位(AVEL)。现有方法往往过度专注于单个任务,忽视了这些不同事件对于全面理解视频内容的同等重要性。本研究旨在开发一个统一框架,以协同解决TAL、SED和AVEL任务,从而促进对视频的整体理解。然而,由于不同任务强调不同的事件特征,且现有任务特定数据集(规模/领域/时长)存在显著差异,这带来了挑战。采用简单的多任务策略会导致结果不尽如人意。为解决此问题,我们提出了UniAV,一个统一视听感知网络,以有效学习和共享跨任务与跨模态的互益知识。具体而言,我们设计了一个统一视听编码器,从所有任务的视频中提取多时间尺度的通用表示。同时,任务特定专家模块被设计用于捕获每个任务独有的知识。此外,我们摒弃了独立的预测头,通过利用语义对齐的任务提示,开发了一种新颖的统一语言感知分类器,使我们的模型能够灵活定位跨任务的各种实例,并具备出色的开放集能力以定位新类别。大量实验表明,UniAV凭借其统一架构,在所有三个任务上均显著优于单任务模型及简单的多任务基线。在ActivityNet 1.3、DESED和UnAV-100基准测试中,其性能达到或超越了最先进的任务特定方法。