Developing robots that are capable of many skills and generalization to unseen scenarios requires progress on two fronts: efficient collection of large and diverse datasets, and training of high-capacity policies on the collected data. While large datasets have propelled progress in other fields like computer vision and natural language processing, collecting data of comparable scale is particularly challenging for physical systems like robotics. In this work, we propose a framework to bridge this gap and better scale up robot learning, under the lens of multi-task, multi-scene robot manipulation in kitchen environments. Our framework, named CACTI, has four stages that separately handle data collection, data augmentation, visual representation learning, and imitation policy training. In the CACTI framework, we highlight the benefit of adapting state-of-the-art models for image generation as part of the augmentation stage, and the significant improvement of training efficiency by using pretrained out-of-domain visual representations at the compression stage. Experimentally, we demonstrate that 1) on a real robot setup, CACTI enables efficient training of a single policy capable of 10 manipulation tasks involving kitchen objects, and robust to varying layouts of distractor objects; 2) in a simulated kitchen environment, CACTI trains a single policy on 18 semantic tasks across up to 50 layout variations per task. The simulation task benchmark and augmented datasets in both real and simulated environments will be released to facilitate future research.
翻译:在这项工作中,我们提出了一个框架,以弥合这一差距,并在厨房环境的多任务、多扫描机器人操纵下,更好地扩大机器人的学习范围。我们称为CACTI的框架有四个阶段,分别处理数据收集、数据增强、视觉演示学习和模拟政策培训。在CACTI框架内,我们强调改造最先进的图像生成模型作为增强阶段的一部分的好处,以及通过在压缩阶段使用预先培训的外部外表显示,大大提高培训效率。我们实验性地表明,1)关于真正的机器人设置,CACTI能够有效地培训一项单一的政策,该政策可包含10项涉及厨房对象的操作任务,以及坚固到不同变异式的变异型。在CACTI框架内,我们强调调整最先进的图像生成模型的好处,作为增强阶段的一部分,以及通过在压缩阶段使用预先培训的外部外向外的图像展示,我们提出了一个框架。实验性地表明,1)关于真正的机器人设置,CACTI能够高效地培训一项单一的政策,该政策涉及厨房对象的10项操作任务,以及坚固地进行稳定的变异式的配置。在CARCI上,每18天平的模型任务中,一个模拟任务中,一个模拟任务将一个模拟到一个SALLI将可升级到18天平压。