Vision-based imitation learning has enabled impressive robotic manipulation skills, but its reliance on object appearance while ignoring the underlying 3D scene structure leads to low training efficiency and poor generalization. To address these challenges, we introduce \emph{Implicit Scene Supervision (ISS) Policy}, a 3D visuomotor DiT-based diffusion policy that predicts sequences of continuous actions from point cloud observations. We extend DiT with a novel implicit scene supervision module that encourages the model to produce outputs consistent with the scene's geometric evolution, thereby improving the performance and robustness of the policy. Notably, ISS Policy achieves state-of-the-art performance on both single-arm manipulation tasks (MetaWorld) and dexterous hand manipulation (Adroit). In real-world experiments, it also demonstrates strong generalization and robustness. Additional ablation studies show that our method scales effectively with both data and parameters. Code and videos will be released.
翻译:基于视觉的模仿学习已实现令人瞩目的机器人操作技能,但其依赖物体外观而忽略底层三维场景结构,导致训练效率低下且泛化能力差。为应对这些挑战,我们提出隐式场景监督策略,这是一种基于点云观测预测连续动作序列的三维视觉运动DiT扩散策略。我们通过新颖的隐式场景监督模块扩展DiT,该模块促使模型生成与场景几何演化一致的输出,从而提升策略的性能与鲁棒性。值得注意的是,ISS策略在单臂操作任务和灵巧手操作任务上均达到最先进性能。在真实世界实验中,该方法亦展现出强大的泛化能力与鲁棒性。额外的消融研究表明,我们的方法在数据与参数规模上均能有效扩展。代码与演示视频将公开。