Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.
翻译:视觉-语言-动作(VLA)模型在机器人操控任务中展现出强大的泛化能力,但其本质上仍以反应式及二维感知为主,在需要精确三维推理的任务中可靠性不足。本文提出GeoPredict,一种几何感知的VLA框架,通过预测性运动学与几何先验增强连续动作策略。GeoPredict包含两个核心模块:轨迹级模块编码运动历史并预测机器人手臂的多步三维关键点轨迹;预测性三维高斯几何模块沿未来关键点轨迹通过轨迹引导优化预测工作空间几何。这些预测模块仅通过基于深度的渲染在训练阶段提供监督,推理时仅需轻量化的附加查询令牌而无需任何三维解码过程。在RoboCasa Human-50、LIBERO及真实世界操控任务上的实验表明,GeoPredict持续优于现有强VLA基线方法,尤其在几何密集与空间要求严苛的场景中表现突出。