While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
翻译:尽管基于视频生成的具身世界模型日益受到关注,但其对大规模具身交互数据的依赖仍是关键瓶颈。具身数据的稀缺性、收集难度和高维度从根本上限制了语言与动作之间的对齐粒度,并加剧了长时程视频生成的挑战——阻碍生成模型在具身领域实现‘GPT时刻’。我们提出一个朴素观察:具身数据的多样性远超相对有限的原始动作空间。基于这一洞见,我们提出一种新颖的世界建模范式——原始具身世界模型(PEWM)。通过将视频生成限制在固定的短时程内,我们的方法能够:1)实现语言概念与机器人动作视觉表征之间的细粒度对齐;2)降低学习复杂度;3)提高具身数据收集的数据效率;4)减少推理延迟。通过配备模块化视觉语言模型(VLM)规划器和起始-目标热图引导机制(SGG),PEWM进一步实现了灵活的闭环控制,并支持原始级策略在扩展复杂任务上的组合泛化。我们的框架利用视频模型中的时空视觉先验和VLM的语义感知能力,弥合细粒度物理交互与高层推理之间的鸿沟,为可扩展、可解释、通用化的具身智能铺平道路。