Vision-Language-Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating high-level planning with precise manipulation. Therefore, we aim to endow a VLA model with the capability to infer the "how" process from the "what" outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate manuals consisting of images, position prompts, and textual instructions. Building upon these multimodal manuals, we design a Manual Chain-of-Thought (ManualCoT) reasoning process that feeds them into the action expert, where each manual step provides explicit control conditions, while its latent representation offers implicit guidance for accurate manipulation. To alleviate the burden of data collection, we develop a high-fidelity digital-twin toolkit based on 3D Gaussian Splatting, which automatically generates manual data for planning expert training. ManualVLA demonstrates strong real-world performance, achieving an average success rate 32% higher than the previous hierarchical SOTA baseline on LEGO assembly and object rearrangement tasks.
翻译:视觉-语言-动作(VLA)模型近年来崭露头角,在机器人场景理解与操作中展现出强大的泛化能力。然而,当面对需要明确目标状态的长时程任务(如乐高积木组装或物体重排)时,现有VLA模型在协调高层规划与精确操作方面仍面临挑战。因此,我们旨在赋予VLA模型从“结果”推断“过程”的能力,将目标状态转化为可执行流程。本文提出ManualVLA——一种基于混合Transformer(MoT)架构构建的统一VLA框架,实现了多模态手册生成与动作执行的协同协作。与先前直接将感知输入映射为动作的VLA模型不同,我们首先为ManualVLA配备规划专家模块,用于生成包含图像、位置提示与文本指令的中间手册。基于这些多模态手册,我们设计了手册思维链(ManualCoT)推理流程,将其馈入动作专家模块:每个手册步骤提供显式控制条件,而其潜在表征则为精确操作提供隐式引导。为减轻数据收集负担,我们开发了基于3D高斯泼溅的高保真数字孪生工具包,可自动生成用于规划专家训练的手册数据。ManualVLA在真实场景中表现优异,在乐高组装与物体重排任务上的平均成功率较先前分层SOTA基线提升32%。