When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89x speedup in training time and 1.54x speedup in inference speed. More results can be found on our project website at https://verm-ral.github.io .
翻译:在执行三维操控任务时,机器人需基于多个固定摄像头的感知进行动作规划。多摄像头配置引入了大量冗余及无关信息,这不仅增加了计算成本,还迫使模型耗费额外训练时间提取关键任务相关细节。为滤除冗余信息并精确提取任务相关特征,我们提出VERM(机器人操控虚拟视点)方法,利用基础模型中的知识,从构建的三维点云中想象出一个虚拟的任务自适应视点,该视点能高效捕获必要信息并减轻遮挡影响。为促进三维动作规划与细粒度操控,我们进一步设计了深度感知模块及动态由粗到精的处理流程。在仿真基准RLBench及真实场景评估中的大量实验结果验证了本方法的有效性,其性能超越先前最优方法,同时实现了1.89倍的训练加速与1.54倍的推理加速。更多结果请访问项目网站:https://verm-ral.github.io。