Generating realistic robotic manipulation videos is an important step toward unifying perception, planning, and action in embodied agents. While existing video diffusion models require large domain-specific datasets and struggle to generalize, recent image generation models trained on language-image corpora exhibit strong compositionality, including the ability to synthesize temporally coherent grid images. This suggests a latent capacity for video-like generation even without explicit temporal modeling. We explore whether such models can serve as visual planners for robots when lightly adapted using LoRA finetuning. We propose a two-part framework that includes: (1) text-conditioned generation, which uses a language instruction and the first frame, and (2) trajectory-conditioned generation, which uses a 2D trajectory overlay and the same initial frame. Experiments on the Jaco Play dataset, Bridge V2, and the RT1 dataset show that both modes produce smooth, coherent robot videos aligned with their respective conditions. Our findings indicate that pretrained image generators encode transferable temporal priors and can function as video-like robotic planners under minimal supervision. Code is released at \href{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}.
翻译:生成真实的机器人操作视频是实现具身智能体中感知、规划与行动统一的关键步骤。尽管现有的视频扩散模型需要大量领域特定数据且泛化能力有限,但近期基于语言-图像语料库训练的图像生成模型展现出强大的组合性,包括合成时间连贯的网格图像的能力。这表明即使没有显式的时间建模,此类模型仍具备潜在的类视频生成能力。我们探讨了通过LoRA微调轻量适配后,此类模型是否可作为机器人的视觉规划器。我们提出了一个包含两部分的框架:(1) 文本条件生成,利用语言指令和首帧图像;(2) 轨迹条件生成,利用二维轨迹叠加和相同的初始帧。在Jaco Play数据集、Bridge V2和RT1数据集上的实验表明,两种模式均能生成平滑、连贯且符合相应条件的机器人操作视频。我们的研究结果表明,预训练图像生成器编码了可迁移的时间先验,并能在最小监督下作为类视频的机器人规划器。代码发布于 \href{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}。