Stereo cameras closely mimic human binocular vision, providing rich spatial cues critical for precise robotic manipulation. Despite their advantage, the adoption of stereo vision in vision-language-action models (VLAs) remains underexplored. In this work, we present StereoVLA, a VLA model that leverages rich geometric cues from stereo vision. We propose a novel Geometric-Semantic Feature Extraction module that utilizes vision foundation models to extract and fuse two key features: 1) geometric features from subtle stereo-view differences for spatial perception; 2) semantic-rich features from the monocular view for instruction following. Additionally, we propose an auxiliary Interaction-Region Depth Estimation task to further enhance spatial perception and accelerate model convergence. Extensive experiments show that our approach outperforms baselines by a large margin in diverse tasks under the stereo setting and demonstrates strong robustness to camera pose variations.
翻译:立体相机紧密模拟人类双目视觉,为精确的机器人操作提供了关键的空间线索。尽管具有这一优势,立体视觉在视觉-语言-动作模型中的应用仍未得到充分探索。本文提出StereoVLA,一种利用立体视觉丰富几何线索的VLA模型。我们提出了一种新颖的几何-语义特征提取模块,该模块利用视觉基础模型来提取并融合两个关键特征:1)来自细微立体视图差异的几何特征,用于空间感知;2)来自单目视图的语义丰富特征,用于指令跟随。此外,我们提出了一个辅助的交互区域深度估计任务,以进一步增强空间感知并加速模型收敛。大量实验表明,在立体视觉设置下,我们的方法在多种任务中大幅超越基线模型,并对相机位姿变化表现出强大的鲁棒性。