Data scarcity remains a fundamental barrier to achieving fully autonomous surgical robots. While large scale vision language action (VLA) models have shown impressive generalization in household and industrial manipulation by leveraging paired video action data from diverse domains, surgical robotics suffers from the paucity of datasets that include both visual observations and accurate robot kinematics. In contrast, vast corpora of surgical videos exist, but they lack corresponding action labels, preventing direct application of imitation learning or VLA training. In this work, we aim to alleviate this problem by learning policy models from SurgWorld, a world model designed for surgical physical AI. We curated the Surgical Action Text Alignment (SATA) dataset with detailed action description specifically for surgical robots. Then we built SurgeWorld based on the most advanced physical AI world model and SATA. It's able to generate diverse, generalizable and realistic surgery videos. We are also the first to use an inverse dynamics model to infer pseudokinematics from synthetic surgical videos, producing synthetic paired video action data. We demonstrate that a surgical VLA policy trained with these augmented data significantly outperforms models trained only on real demonstrations on a real surgical robot platform. Our approach offers a scalable path toward autonomous surgical skill acquisition by leveraging the abundance of unlabeled surgical video and generative world modeling, thus opening the door to generalizable and data efficient surgical robot policies.
翻译:数据稀缺仍然是实现完全自主手术机器人的根本障碍。尽管大规模视觉语言动作(VLA)模型通过利用来自不同领域的配对视频动作数据,在家庭和工业操作中展现出令人印象深刻的泛化能力,但手术机器人领域却因缺乏同时包含视觉观测和精确机器人运动学的数据集而受限。相比之下,存在大量手术视频语料库,但它们缺乏相应的动作标签,这阻碍了模仿学习或VLA训练的直接应用。在本工作中,我们旨在通过从SurgWorld(一个专为手术物理AI设计的世界模型)中学习策略模型来缓解此问题。我们构建了专门针对手术机器人的手术动作文本对齐(SATA)数据集,其中包含详细的动作描述。随后,我们基于最先进的物理AI世界模型和SATA构建了SurgWorld。该模型能够生成多样化、可泛化且真实的手术视频。我们也是首个使用逆动力学模型从合成手术视频中推断伪运动学,从而生成合成配对视频动作数据的研究。我们证明,使用这些增强数据训练的手术VLA策略,在真实手术机器人平台上显著优于仅使用真实演示数据训练的模型。我们的方法通过利用大量未标注手术视频和生成式世界建模,为自主手术技能获取提供了一条可扩展的路径,从而为可泛化且数据高效的手术机器人策略打开了大门。