Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.
翻译:视觉语言模型(VLMs)在定性视觉理解方面表现出色,但在具身应用所需的度量精确空间推理方面仍存在困难。智能体范式表明,VLMs可以利用多种工具来增强这些能力,例如深度估计器、分割模型和姿态估计器。然而,如何在不依赖手工提示策略或强制使用固定的预定义工具流水线(这会限制VLMs发现最优工具使用模式的能力)的情况下实现这一愿景,仍是一个开放挑战。强化学习可能克服这一差距,但由于多工具推理中的搜索空间巨大,目前仅限于使用单一视觉工具进行推理。我们提出了双重交互式强化学习(DIRL),这是一个两阶段训练框架,VLMs通过交互式探索和反馈学习协调多个工具。在教学阶段,我们将通过交互式强化学习训练的单工具专家的演示与使用所有工具的前沿模型轨迹相结合。在探索阶段,模型通过持续的强化学习进一步优化多工具协调。我们的模型SpaceTools具备工具增强的空间推理能力,在空间理解基准测试(RoboSpatial-Home、BLINK、BOP-ASK)上取得了最先进的性能,并展示了使用7自由度机器人作为工具进行可靠的真实世界操作。DIRL相较于原始监督微调基线(在RoboSpatial上提升12%)和强化学习基线(在RoboSpatial上提升16%)有显著改进。项目页面:https://spacetools.github.io/。