Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation subtasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. Specifically, a Decoupled Mixture-of-Expert (DeMoE) mechanism is presented to decouple the unified representation learning task into the modeling process of cross-modal shared knowledge and specific information, thus enabling the model to maintain flexibility while enhancing generalization. Additionally, we introduce a Task-aware Multi-object Tracking (TaMOT) pipeline to unify all the task outputs as a unified set of instances with calibrated ID information, thereby alleviating the degradation of task-specific knowledge during multi-task training. SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.


翻译:跟踪与分割在视频理解中发挥着核心作用,为视频序列中的对象提供基础的位置信息与时序关联。尽管目标一致,现有方法通常采用专用架构或模态特定参数处理这些任务,限制了其泛化能力与可扩展性。近期研究尝试从任意模态输入或多任务推理的角度统一多个跟踪与分割子任务,但这些方法往往忽视两个关键挑战:不同模态间的分布差异以及跨任务的特征表示差异。这些问题阻碍了有效的跨任务与跨模态知识共享,最终制约了真正通用模型的发展。为解决这些局限,我们提出一种通用跟踪与分割框架SATA,该框架通过任意模态输入统一了广泛的跟踪与分割子任务。具体而言,我们提出解耦专家混合机制,将统一表示学习任务分解为跨模态共享知识与特定信息的建模过程,使模型在保持灵活性的同时增强泛化能力。此外,我们引入任务感知多目标跟踪流程,将所有任务输出统一为具有校准ID信息的实例集合,从而缓解多任务训练中任务特定知识的退化。SATA在18个具有挑战性的跟踪与分割基准测试中展现出卓越性能,为更具泛化性的视频理解提供了新视角。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员