Recent advances in robot learning have shown promise in enabling robots to perform a variety of manipulation tasks and generalize to novel scenarios. One of the key contributing factors to this progress is the scale of robot data used to train the models. To obtain large-scale datasets, prior approaches have relied on either demonstrations requiring high human involvement or engineering-heavy autonomous data collection schemes, both of which are challenging to scale. To mitigate this issue, we propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing to obtain meaningful data for robot learning without requiring additional robot data. We term our method Robot Learning with Semantically Imagened Experience (ROSIE). Specifically, we make use of the state of the art text-to-image diffusion models and perform aggressive data augmentation on top of our existing robotic manipulation datasets via inpainting various unseen objects for manipulation, backgrounds, and distractors with text guidance. Through extensive real-world experiments, we show that manipulation policies trained on data augmented this way are able to solve completely unseen tasks with new objects and can behave more robustly w.r.t. novel distractors. In addition, we find that we can improve the robustness and generalization of high-level robot learning tasks such as success detection through training with the diffusion-based data augmentation. The project's website and videos can be found at diffusion-rosie.github.io
翻译:机器人学习的最近进展显示,在使机器人能够执行各种操作任务和对新情景进行概括化方面,有希望使机器人能够执行各种操作任务,并能够推广到各种新情景。这一进展的关键促成因素之一是用于培训模型的机器人数据规模。为了获得大规模数据集,以前的办法依赖于需要人高度参与的演示或工程重度自主数据收集计划,两者都具有规模的难度。为了缓解这一问题,我们提议了一条替代路线并利用在计算机视觉和自然语言处理中广泛使用的文本到图像基础模型,以便为机器人学习获得有意义的数据,而不需要额外的机器人数据。我们用模拟模拟经验(ROSIE)来形容我们的方法机器人学习。具体地说,我们利用艺术文本到图像扩散模型的现状,并在我们现有的机器人操作数据集之外,通过对各种看不见的物体进行编译,用于操作、背景和转移控制器,以及文本指导。通过广泛的现实世界实验,我们显示,经过培训的文本到这种方式的操纵政策能够用新对象彻底地解决问题,并且可以更加稳健健地行为 w.r.t.