Relying on multi-modal observations, embodied robots (e.g., humanoid robots) could perform multiple robotic manipulation tasks in unstructured real-world environments. However, most language-conditioned behavior-cloning agents in robots still face existing long-standing challenges, i.e., 3D scene representation and human-level task learning, when adapting into a series of new tasks in practical scenarios. We here investigate these above challenges with NBAgent in embodied robots, a pioneering language-conditioned Never-ending Behavior-cloning Agent, which can continually learn observation knowledge of novel 3D scene semantics and robot manipulation skills from skill-shared and skill-specific attributes, respectively. Specifically, we propose a skill-shared semantic rendering module and a skill-shared representation distillation module to effectively learn 3D scene semantics from skill-shared attribute, further tackling 3D scene representation overlooking. Meanwhile, we establish a skill-specific evolving planner to perform manipulation knowledge decoupling, which can continually embed novel skill-specific knowledge like human from latent and low-rank space. Finally, we design a never-ending embodied robot manipulation benchmark, and expensive experiments demonstrate the significant performance of our method.
翻译:依赖多模态观测,具身机器人(如仿人机器人)能够在非结构化的真实世界环境中执行多种机器人操作任务。然而,当适应实际场景中的一系列新任务时,大多数语言条件行为克隆智能体在机器人中仍面临长期存在的挑战,即三维场景表示和人类水平任务学习。本文通过具身机器人中的NBAgent——一种开创性的语言条件永不终止行为克隆智能体——来研究上述挑战。该智能体能够分别从技能共享属性和技能特定属性中持续学习新颖三维场景语义的观测知识与机器人操作技能。具体而言,我们提出了一个技能共享语义渲染模块和一个技能共享表示蒸馏模块,以从技能共享属性中有效学习三维场景语义,进而解决三维场景表示被忽视的问题。同时,我们建立了一个技能特定演化规划器来执行操作知识解耦,该规划器能够像人类一样从潜在低秩空间中持续嵌入新颖的技能特定知识。最后,我们设计了一个永不终止的具身机器人操作基准测试,大量实验证明了我们方法的卓越性能。