Generating realistic robotic manipulation videos is an important step toward unifying perception, planning, and action in embodied agents. While existing video diffusion models require large domain-specific datasets and struggle to generalize, recent image generation models trained on language-image corpora exhibit strong compositionality, including the ability to synthesize temporally coherent grid images. This suggests a latent capacity for video-like generation even without explicit temporal modeling. We explore whether such models can serve as visual planners for robots when lightly adapted using LoRA finetuning. We propose a two-part framework that includes: (1) text-conditioned generation, which uses a language instruction and the first frame, and (2) trajectory-conditioned generation, which uses a 2D trajectory overlay and the same initial frame. Experiments on the Jaco Play dataset, Bridge V2, and the RT1 dataset show that both modes produce smooth, coherent robot videos aligned with their respective conditions. Our findings indicate that pretrained image generators encode transferable temporal priors and can function as video-like robotic planners under minimal supervision. Code is released at \href{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}.


翻译:生成真实的机器人操作视频是实现具身智能体中感知、规划与行动统一的关键步骤。尽管现有的视频扩散模型需要大量领域特定数据且泛化能力有限,但近期基于语言-图像语料库训练的图像生成模型展现出强大的组合性,包括合成时间连贯的网格图像的能力。这表明即使没有显式的时间建模,此类模型仍具备潜在的类视频生成能力。我们探讨了通过LoRA微调轻量适配后,此类模型是否可作为机器人的视觉规划器。我们提出了一个包含两部分的框架:(1) 文本条件生成,利用语言指令和首帧图像;(2) 轨迹条件生成,利用二维轨迹叠加和相同的初始帧。在Jaco Play数据集、Bridge V2和RT1数据集上的实验表明,两种模式均能生成平滑、连贯且符合相应条件的机器人操作视频。我们的研究结果表明,预训练图像生成器编码了可迁移的时间先验,并能在最小监督下作为类视频的机器人规划器。代码发布于 \href{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}。

0
下载
关闭预览

相关内容

ACM/IEEE第23届模型驱动工程语言和系统国际会议,是模型驱动软件和系统工程的首要会议系列,由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来,模型涵盖了建模的各个方面,从语言和方法到工具和应用程序。模特的参加者来自不同的背景,包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛,参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会,并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。 官网链接:http://www.modelsconference.org/
Top
微信扫码咨询专知VIP会员