Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction following and versatile action generation in changing environments. This paper examines three paradigms for building robotic systems: end-to-end vision-language-action (VLA) models that implicitly integrate perception and planning, and modular pipelines incorporating either vision-language models (VLMs) or multimodal large language models (LLMs). We evaluate these paradigms through two focused case studies: a complex instruction grounding task assessing fine-grained instruction understanding and cross-modal disambiguation, and an object manipulation task targeting skill transfer via VLA finetuning. Our experiments in zero-shot and few-shot settings reveal trade-offs in generalization and data efficiency. By exploring performance limits, we distill design implications for developing language-driven physical agents and outline emerging challenges and opportunities for FM-powered robotics in real-world conditions.
翻译:基础模型(FMs)在具身智能体中被越来越多地用于连接语言与动作,然而不同FM集成策略的操作特性仍待深入探究——尤其是在变化环境中实现复杂指令跟随与多样化动作生成方面。本文研究了构建机器人系统的三种范式:隐式整合感知与规划的端到端视觉-语言-动作模型,以及结合视觉-语言模型或多模态大语言模型的模块化流程。我们通过两个聚焦案例进行评估:一个针对细粒度指令理解与跨模态消歧的复杂指令基础任务,另一个通过VLA微调实现技能迁移的物体操作任务。在零样本和少样本设置下的实验揭示了泛化能力与数据效率之间的权衡。通过探索性能极限,我们提炼出开发语言驱动物理智能体的设计启示,并概述了现实条件下FM赋能机器人技术面临的新挑战与机遇。