Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .


翻译:以推理为中心的视频对象分割本质上是一项复杂任务:查询通常涉及动态性、因果关系及时间交互,而非静态外观。然而,现有解决方案通常将这些因素简化为潜在嵌入的推理,使得推理链变得不透明且本质上难以处理。因此,我们采用显式分解的视角,提出了ReVSeg,该方法在预训练视觉语言模型(VLMs)的原生接口中,将推理执行为序列化决策。ReVSeg并非将所有推理折叠为单步预测,而是执行三个显式操作——语义解释、时序证据选择与空间定位——以对齐预训练能力。我们进一步采用强化学习来优化多步推理链,使模型能够从结果驱动的信号中自我优化决策质量。实验结果表明,ReVSeg在标准视频对象分割基准测试中达到了最先进的性能,并产生了可解释的推理轨迹。项目页面详见 https://clementine24.github.io/ReVSeg/ 。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员