Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.
翻译:空间感知与推理对于视觉-语言-动作模型完成细粒度操作任务至关重要。然而,现有方法通常缺乏对精确控制所必需的三维结构的理解与推理能力。为应对这一局限,我们提出QDepth-VLA——一种通过辅助深度预测任务增强VLA模型的通用框架。我们设计了一个专用的深度专家模块,用于预测通过VQ-VAE编码器获得的深度图的量化潜在标记,从而使模型能够学习捕获关键几何线索的深度感知表征。在仿真基准测试和实际任务上的实验结果表明,QDepth-VLA在操作任务中展现出强大的空间推理能力和优异的性能。