Most Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals, yet both typically run at a single unified frequency. As a result, policy performance is constrained by the low inference speed of large VLMs. This mandatory synchronous execution severely limits control stability and real-time performance in whole-body robotic manipulation, which involves more joints, larger motion spaces, and dynamically changing views. We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS), organizing the system into a fast pathway for high-frequency action generation and a slow pathway for rich VLM reasoning. The system is characterized by two key features. First, a latent representation buffer bridges the slow and fast systems. It stores instruction semantics and action-reasoning representation aligned with the scene-instruction context, providing high-level guidance to the fast pathway. Second, a whole-body action tokenizer provides a compact, unified representation of whole-body actions. Importantly, the VLM and action expert are still jointly trained end-to-end, preserving unified policy learning while enabling asynchronous execution. DuoCore-FS supports a 3B-parameter VLM while achieving 30 Hz whole-body action-chunk generation, approximately three times as fast as prior VLA models with comparable model sizes. Real-world whole-body manipulation experiments demonstrate improved task success rates and significantly enhanced responsiveness compared to synchronous Fast-Slow VLA baselines. The implementation of DuoCore-FS, including training, inference, and deployment, is provided to commercial users by Astribot as part of the Astribot robotic platform.
翻译:多数视觉-语言-动作系统将用于语义推理的视觉-语言模型与生成连续动作信号的动作专家集成于单一运行频率下。因此,策略性能受限于大型视觉-语言模型的低推理速度。这种强制同步执行严重制约了全身机器人操控中的控制稳定性和实时性能,而全身操控涉及更多关节、更大运动空间以及动态变化的视野。本文提出一种真正的异步快慢视觉-语言-动作框架,将系统组织为高频动作生成的快速通路与丰富视觉-语言模型推理的慢速通路。该框架具有两个关键特征:首先,潜在表示缓冲区连接慢速与快速系统,存储与场景-指令上下文对齐的指令语义和动作推理表示,为快速通路提供高层指导;其次,全身动作分词器提供紧凑统一的全身动作表示。重要的是,视觉-语言模型与动作专家仍通过端到端联合训练,在保持统一策略学习的同时实现异步执行。该框架支持30亿参数视觉-语言模型,同时实现30赫兹的全身动作块生成速度,比同等规模的前沿视觉-语言-动作模型快约三倍。真实世界全身操控实验表明,相较于同步快慢视觉-语言-动作基线,该系统在任务成功率和响应速度方面均有显著提升。该框架的实现(包括训练、推理和部署)已由Astribot作为机器人平台组成部分向商业用户提供。