视频模型是否已具备零样本推理能力？基于MME-CoF基准的实证研究 (Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark)

Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

翻译：近期视频生成模型能够生成高保真度、时序连贯的视频，表明其可能编码了丰富的世界知识。除了逼真的合成能力外，这些模型还展现出视觉感知、建模与操作的新兴行为特征。然而，一个重要问题依然存在：视频模型是否已准备好作为零样本推理器应对具有挑战性的视觉推理场景？本研究通过实证方法系统探究该问题，聚焦于当前领先且流行的Veo-3模型。我们从空间、几何、物理、时序及具身逻辑等12个维度评估其推理行为，系统刻画其优势与失效模式。为规范研究，我们将评估数据整合为MME-CoF——一个支持对帧序列（CoF）推理进行深度全面评估的紧凑基准。研究发现：当前视频模型在短时域空间连贯性、细粒度定位及局部一致动力学方面展现出有前景的推理模式，但在长时域因果推理、严格几何约束及抽象逻辑方面仍存在局限。总体而言，视频模型尚未成为可靠的独立零样本推理器，但作为专用推理模型的互补视觉引擎已展现出令人鼓舞的潜力。项目页面：https://video-cof.github.io

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日