面向视频-语言理解的持续学习：基于可供性优先的分解方法 (Affordance-First Decomposition for Continual Learning in Video-Language Understanding)

Continual learning for video--language understanding is increasingly important as models face non-stationary data, domains, and query styles, yet prevailing solutions blur what should stay stable versus what should adapt, rely on static routing/capacity, or require replaying past videos. We aim to explicitly specify where stability lives and where plasticity should be focused under realistic memory and privacy constraints. We introduce Affordance-First Decomposition (AFD): videos are mapped to slowly varying affordance tokens that form a shared, time-aligned substrate, while a lightweight, query-routed, conflict-aware scheduler concentrates adaptation and grows capacity only when needed. The substrate is stabilized via weak alignment and teacher consistency, and training uses question-only replay. AFD achieves state-of-the-art across protocols: 51.6% average accuracy with -1.8% forgetting on domain-incremental VideoQA, ViLCo R@1@0.5 of 29.6% (MQ) and 20.7% (NLQ) with 18.4% stAP@0.25 (VQ), and 39.5% accuracy with -1.6% forgetting on time-incremental iVQA. Overall, AFD offers an explicit, interpretable split between a stable interaction-centered substrate and targeted adaptation.

翻译：随着模型面临非平稳数据、领域和查询风格，视频-语言理解的持续学习日益重要，然而现有解决方案往往模糊了应保持稳定与应适应变化之间的界限，依赖静态路由/容量机制，或需要重放历史视频。本研究旨在明确指定在现实内存与隐私约束下，稳定性应驻留何处以及可塑性应聚焦于何处。我们提出可供性优先分解方法：将视频映射为缓慢变化的可供性标记，这些标记构成共享的、时间对齐的基础层；同时，一个轻量级、查询路由、冲突感知的调度器仅在需要时集中进行适应并扩展容量。该基础层通过弱对齐和教师一致性实现稳定化，训练过程仅使用问题重放。AFD在多种协议下达到最先进性能：在领域增量视频问答中平均准确率达51.6%，遗忘率为-1.8%；在ViLCo任务中R@1@0.5指标达29.6%（MQ）和20.7%（NLQ），VQ任务中stAP@0.25达18.4%；在时间增量iVQA任务中准确率达39.5%，遗忘率为-1.6%。总体而言，AFD在稳定的交互中心基础层与针对性适应之间提供了明确且可解释的分离机制。