Building a generalist robot that can perceive, reason, and act across diverse tasks remains an open challenge, especially for dexterous manipulation. A major bottleneck lies in the scarcity of large-scale, action-annotated data for dexterous skills, as teleoperation is difficult and costly. Human data, with its vast scale and diverse manipulation behaviors, provides rich priors for learning robotic actions. While prior works have explored leveraging human demonstrations, they are often constrained by limited scenarios and a large visual gap between human and robots. To eliminate these limitations, we propose METIS, a vision-language-action (VLA) model for dexterous manipulation pretrained on multi-source egocentric datasets. We first construct EgoAtlas, which integrates large-scale human and robotic data from multiple sources, all unified under a consistent action space. We further extract motion-aware dynamics, a compact and discretized motion representation, which provides efficient and expressive supervision for VLA training. Built upon them, METIS integrates reasoning and acting into a unified framework, enabling effective deployment to downstream dexterous manipulation tasks. Our method demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks. Experimental results also highlight the superior generalization and robustness to out-of-distribution scenarios. These findings emphasize METIS as a promising step toward a generalist model for dexterous manipulation.
翻译:构建能够跨多样化任务进行感知、推理与执行的通用机器人仍是一个开放挑战,尤其在灵巧操作领域。主要瓶颈在于缺乏大规模、带动作标注的灵巧操作数据,因为遥操作技术难度高且成本昂贵。人类数据以其海量规模和多样化的操作行为,为学习机器人动作提供了丰富的先验知识。尽管先前研究已探索利用人类示范数据,但常受限于有限场景以及人类与机器人视觉数据间的巨大差异。为消除这些限制,我们提出METIS——一个基于多源第一视角数据集预训练的灵巧操作视觉-语言-动作模型。我们首先构建EgoAtlas数据集,整合来自多源的大规模人类与机器人数据,并将其统一于一致的动作空间。进一步提取运动感知动态特征——一种紧凑且离散化的运动表征,为VLA训练提供高效且富有表现力的监督信号。基于此,METIS将推理与执行集成于统一框架,使其能有效部署至下游灵巧操作任务。我们的方法展现出卓越的灵巧操作能力,在六项现实任务中取得最高平均成功率。实验结果同时凸显了模型在分布外场景下优异的泛化能力与鲁棒性。这些发现表明METIS是迈向通用灵巧操作模型的重要进展。