World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.
翻译:世界模型正逐渐成为可扩展、数据高效的具身人工智能的基础范式。在本研究中,我们提出了GigaWorld-0,一个明确设计为视觉-语言-动作学习数据引擎的统一世界模型框架。GigaWorld-0整合了两个协同组件:GigaWorld-0-Video,它利用大规模视频生成,在对外观、摄像机视角和动作语义的细粒度控制下,产生多样化、纹理丰富且时间连贯的具身序列;以及GigaWorld-0-3D,它结合了3D生成建模、3D高斯泼溅重建、物理可微分系统辨识和可执行运动规划,以确保几何一致性和物理真实感。它们的联合优化实现了具身交互数据的可扩展合成,这些数据在视觉上引人注目、空间上连贯、物理上合理且与指令对齐。通过我们高效的GigaTrain框架,大规模训练变得可行,该框架利用FP8精度和稀疏注意力,大幅降低了内存和计算需求。我们进行了全面评估,表明GigaWorld-0在多个维度上生成了高质量、多样化且可控的数据。关键的是,在GigaWorld-0生成的数据上训练的VLA模型(例如GigaBrain-0)实现了强大的现实世界性能,显著提高了物理机器人的泛化能力和任务成功率,而训练过程中无需任何现实世界交互。