The collection of large-scale and diverse robot demonstrations remains a major bottleneck for imitation learning, as real-world data acquisition is costly and simulators offer limited diversity and fidelity with pronounced sim-to-real gaps. While generative models present an attractive solution, existing methods often alter only visual appearances without creating new behaviors, or suffer from embodiment inconsistencies that yield implausible motions. To address these limitations, we introduce AnchorDream, an embodiment-aware world model that repurposes pretrained video diffusion models for robot data synthesis. AnchorDream conditions the diffusion process on robot motion renderings, anchoring the embodiment to prevent hallucination while synthesizing objects and environments consistent with the robot's kinematics. Starting from only a handful of human teleoperation demonstrations, our method scales them into large, diverse, high-quality datasets without requiring explicit environment modeling. Experiments show that the generated data leads to consistent improvements in downstream policy learning, with relative gains of 36.4% in simulator benchmarks and nearly double performance in real-world studies. These results suggest that grounding generative world models in robot motion provides a practical path toward scaling imitation learning.
翻译:大规模多样化机器人演示数据的收集仍然是模仿学习的主要瓶颈,因为真实世界数据采集成本高昂,而仿真器提供的多样性和保真度有限,且存在显著的仿真到现实差距。尽管生成模型提供了一种有吸引力的解决方案,但现有方法通常仅改变视觉外观而未创造新行为,或存在具身不一致性问题,导致生成不合理的运动。为应对这些局限,我们提出了AnchorDream,一种具身感知世界模型,通过重新利用预训练视频扩散模型进行机器人数据合成。AnchorDream将扩散过程以机器人运动渲染为条件,锚定具身状态以防止幻觉,同时合成与机器人运动学一致的对象和环境。仅需从少量人工遥操作演示开始,我们的方法即可将其扩展为大规模、多样化、高质量的数据集,无需显式环境建模。实验表明,生成的数据在下游策略学习中带来了持续改进,在仿真基准测试中相对提升36.4%,在真实世界研究中性能提升近一倍。这些结果表明,将生成式世界模型基于机器人运动,为扩展模仿学习提供了一条实用路径。