We present ChronoDreamer, an action-conditioned world model for contact-rich robotic manipulation. Given a history of egocentric RGB frames, contact maps, actions, and joint states, ChronoDreamer predicts future video frames, contact distributions, and joint angles via a spatial-temporal transformer trained with MaskGIT-style masked prediction. Contact is encoded as depth-weighted Gaussian splat images that render 3D forces into a camera-aligned format suitable for vision backbones. At inference, predicted rollouts are evaluated by a vision-language model that reasons about collision likelihood, enabling rejection sampling of unsafe actions before execution. We train and evaluate on DreamerBench, a simulation dataset generated with Project Chrono that provides synchronized RGB, contact splat, proprioception, and physics annotations across rigid and deformable object scenarios. Qualitative results demonstrate that the model preserves spatial coherence during non-contact motion and generates plausible contact predictions, while the LLM-based judge distinguishes collision from non-collision trajectories.
翻译:我们提出ChronoDreamer,一种用于接触密集型机器人操作的動作条件世界模型。给定以自我为中心的RGB帧序列、接触图、动作和关节状态历史,ChronoDreamer通过采用MaskGIT风格掩码预测训练的时空Transformer,预测未来视频帧、接触分布和关节角度。接触被编码为深度加权高斯溅射图像,将三维力渲染为适合视觉骨干网络的相机对齐格式。在推理阶段,预测的推演轨迹由视觉语言模型评估,该模型推理碰撞可能性,从而在执行前对不安全动作进行拒绝采样。我们在DreamerBench上训练和评估,这是一个使用Project Chrono生成的仿真数据集,提供刚性和可变形物体场景中同步的RGB、接触溅射、本体感知和物理标注。定性结果表明,该模型在非接触运动期间保持空间连贯性并生成合理的接触预测,而基于LLM的评判器能有效区分碰撞与非碰撞轨迹。