WholeBodyVLA：面向全身移动操作控制的统一潜在视觉-语言-动作模型 (WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control)

Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset and scale the benefits. To execute the desired locomotion commands more precisely, we present a loco-manipulation-oriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these components, we introduce WholeBodyVLA, a unified framework for humanoid loco-manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid loco-manipulation. It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks.

翻译：人形机器人需要精确的移动能力和灵巧的操作能力以完成具有挑战性的移动操作任务。然而，现有方法，无论是模块化还是端到端方法，均在操作感知的移动能力方面存在不足。这限制了机器人的工作空间，使其无法执行大范围移动操作。我们将此归因于：（1）由于人形遥操作数据稀缺，获取移动操作知识的挑战；（2）现有强化学习控制器精度和稳定性有限，导致难以忠实可靠地执行移动指令。为获取更丰富的移动操作知识，我们提出了一种统一的潜在学习框架，使视觉-语言-动作系统能够从低成本的无动作第一人称视频中学习。此外，设计了一套高效的人类数据收集流程以扩充数据集并扩大效益。为更精确地执行期望的移动指令，我们提出了一种专门针对移动操作优化的强化学习策略，该策略针对精确稳定的核心移动操作动作（如前进、转向和下蹲）进行了定制。基于这些组件，我们引入了WholeBodyVLA，一个用于人形移动操作的统一框架。据我们所知，WholeBodyVLA是首个实现大范围人形移动操作的同类框架。通过在AgiBot X2人形机器人上的综合实验验证，其性能优于先前基线方法21.3%。该框架还在广泛任务中展现出强大的泛化能力和高可扩展性。