Vision-language-action (VLA) models can enable broad open world generalization, but require large and diverse datasets. It is appealing to consider whether some of this data can come from human videos, which cover diverse real-world situations and are easy to obtain. However, it is difficult to train VLAs with human videos alone, and establishing a mapping between humans and robots requires manual engineering and presents a major research challenge. Drawing inspiration from advances in large language models, where the ability to learn from diverse supervision emerges with scale, we ask whether a similar phenomenon holds for VLAs that incorporate human video data. We introduce a simple co-training recipe, and find that human-to-robot transfer emerges once the VLA is pre-trained on sufficient scenes, tasks, and embodiments. Our analysis suggests that this emergent capability arises because diverse pretraining produces embodiment-agnostic representations for human and robot data. We validate these findings through a series of experiments probing human to robot skill transfer and find that with sufficiently diverse robot pre-training our method can nearly double the performance on generalization settings seen only in human data.
翻译:视觉-语言-动作模型能够实现广泛的开放世界泛化,但需要大规模且多样化的数据集。一个吸引人的思路是考虑是否部分数据可来源于人类视频——这些视频覆盖了多样化的真实场景且易于获取。然而,仅用人类视频训练视觉-语言-动作模型存在困难,而建立人类与机器人之间的映射关系需要人工设计并构成重大研究挑战。受大语言模型进展的启发(其从多样化监督中学习的能力随规模扩展而涌现),我们探究类似现象是否存在于融合人类视频数据的视觉-语言-动作模型中。我们提出一种简单的协同训练方法,发现当视觉-语言-动作模型在足够丰富的场景、任务和具身形态上进行预训练后,人机迁移能力会自然涌现。分析表明,这种涌现能力源于多样化预训练产生了适用于人类与机器人数据的具身无关表征。我们通过一系列人机技能迁移实验验证了这些发现,结果表明在足够多样化的机器人预训练基础上,我们的方法在仅见于人类数据的泛化场景中可将性能提升近一倍。