Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io
翻译:近期视频生成模型能够生成高保真度、时序连贯的视频,表明其可能编码了丰富的世界知识。除了逼真的合成能力外,这些模型还展现出视觉感知、建模与操作的新兴行为特征。然而,一个重要问题依然存在:视频模型是否已准备好作为零样本推理器应对具有挑战性的视觉推理场景?本研究通过实证方法系统探究该问题,聚焦于当前领先且流行的Veo-3模型。我们从空间、几何、物理、时序及具身逻辑等12个维度评估其推理行为,系统刻画其优势与失效模式。为规范研究,我们将评估数据整合为MME-CoF——一个支持对帧序列(CoF)推理进行深度全面评估的紧凑基准。研究发现:当前视频模型在短时域空间连贯性、细粒度定位及局部一致动力学方面展现出有前景的推理模式,但在长时域因果推理、严格几何约束及抽象逻辑方面仍存在局限。总体而言,视频模型尚未成为可靠的独立零样本推理器,但作为专用推理模型的互补视觉引擎已展现出令人鼓舞的潜力。项目页面:https://video-cof.github.io