Our goal is for a robot to execute a previously unseen task based on a single video demonstration of the task. The success of our approach relies on the principle of transferring knowledge from seen tasks to unseen ones with similar semantics. More importantly, we hypothesize that to successfully execute a complex task from a single video demonstration, it is necessary to explicitly incorporate compositionality to the model. To test our hypothesis, we propose Neural Task Graph (NTG) Networks, which use task graph as the intermediate representation to modularize the representations of both the video demonstration and the derived policy. We show this formulation achieves strong inter-task generalization on two complex tasks: Block Stacking in BulletPhysics and Object Collection in AI2-THOR. We further show that the same principle is applicable to real-world videos. We show that NTG can improve data efficiency of few-shot activity understanding in the Breakfast Dataset.
翻译:我们的目标是让机器人在单一的视频演示任务的基础上执行先前看不见的任务。 我们的方法的成功取决于将知识从可见的任务转移给具有类似语义的无形任务的原则。 更重要的是, 我们假设要成功执行从单一的视频演示中完成复杂的任务, 就必须将组成性明确纳入模型。 为了测试我们的假设, 我们提议神经任务图网络, 以任务图作为中间代表, 将视频演示和衍生政策的表现模式化。 我们展示了这种配方在两个复杂任务上实现了强有力的跨任务化: 在AI2- THOR中将子弹和物体收集屏蔽成块。 我们进一步显示同一原则适用于真实世界的视频。 我们显示, NTG可以提高在早餐数据集中对少量活动理解的数据效率 。