Recent advances in Chain-of-Thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., event detection, spatial relation understanding, emotion understanding) over various video content. To address this, we propose Video-Skill-CoT (a.k.a. Video-SKoT), a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning. First, we construct skill-based CoT annotations: we extract domain-relevant reasoning skills from training questions, cluster them into a shared skill taxonomy, and create detailed multi-step CoT rationale tailored to each video-question pair for training. Second, we introduce a skill-specific expert learning framework. Each expert module specializes in a subset of reasoning skills and is trained with lightweight adapters using the collected CoT supervision. We demonstrate the effectiveness of the proposed approach on three video understanding benchmarks, where Video-SKoT consistently outperforms strong baselines. We also provide in-depth analyses on comparing different CoT annotation pipelines and learned skills over multiple video domains.
翻译:近期思维链推理技术的进展提升了复杂视频理解能力,但现有方法往往难以针对不同视频内容适应特定领域技能(如事件检测、空间关系理解、情感理解)。为此,我们提出视频技能思维链框架,该框架能自动构建并利用技能感知的思维链监督信号来实现领域自适应的视频推理。首先,我们构建基于技能的思维链标注:从训练问题中提取领域相关的推理技能,将其聚类为共享技能分类体系,并为每个视频-问题对创建定制的多步骤思维链依据用于训练。其次,我们提出技能专项专家学习框架。每个专家模块专精于特定推理技能子集,并通过轻量级适配器利用收集的思维链监督进行训练。我们在三个视频理解基准测试中验证了所提方法的有效性,视频技能思维链框架均持续优于现有强基线模型。此外,我们还对不同思维链标注流程及跨多视频领域习得技能进行了深入对比分析。