Generative video modeling has emerged as a compelling tool to zero-shot reason about plausible physical interactions for open-world manipulation. Yet, it remains a challenge to translate such human-led motions into the low-level actions demanded by robotic systems. We observe that given an initial image and task instruction, these models excel at synthesizing sensible object motions. Thus, we introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation. Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking. By separating the state changes from the actuators that realize those changes, Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categories-including rigid, articulated, deformable, and granular. Through trajectory optimization or reinforcement learning, Dream2Flow converts reconstructed 3D object flow into executable low-level commands without task-specific demonstrations. Simulation and real-world experiments highlight 3D object flow as a general and scalable interface for adapting video generation models to open-world robotic manipulation. Videos and visualizations are available at https://dream2flow.github.io/.
翻译:生成式视频建模已成为一种引人注目的工具,能够对开放世界操控中合理的物理交互进行零样本推理。然而,将此类由人类引导的运动转化为机器人系统所需的底层动作仍然是一个挑战。我们观察到,给定初始图像和任务指令,这些模型擅长合成合理的物体运动。因此,我们提出了Dream2Flow框架,该框架通过三维物体流作为中间表示,桥接了视频生成与机器人控制。我们的方法从生成的视频中重建三维物体运动,并将操控任务表述为物体轨迹跟踪问题。通过将状态变化与实现这些变化的执行器分离,Dream2Flow克服了具身性差距,并能够利用预训练视频模型进行零样本引导,以操控包括刚性、铰接式、可变形和颗粒状在内的多种类别物体。通过轨迹优化或强化学习,Dream2Flow将重建的三维物体流转化为可执行的底层指令,而无需任务特定的演示。仿真和真实世界实验表明,三维物体流是一种通用且可扩展的接口,能够将视频生成模型适配于开放世界机器人操控任务。视频和可视化内容可在 https://dream2flow.github.io/ 获取。