让我展示给你看：通过从自我中心视频中检索学习以实现机器人操作 (Let Me Show You: Learning by Retrieving from Egocentric Video for Robotic Manipulation)

Robots operating in complex and uncertain environments face considerable challenges. Advanced robotic systems often rely on extensive datasets to learn manipulation tasks. In contrast, when humans are faced with unfamiliar tasks, such as assembling a chair, a common approach is to learn by watching video demonstrations. In this paper, we propose a novel method for learning robot policies by Retrieving-from-Video (RfV), using analogies from human demonstrations to address manipulation tasks. Our system constructs a video bank comprising recordings of humans performing diverse daily tasks. To enrich the knowledge from these videos, we extract mid-level information, such as object affordance masks and hand motion trajectories, which serve as additional inputs to enhance the robot model's learning and generalization capabilities. We further feature a dual-component system: a video retriever that taps into an external video bank to fetch task-relevant video based on task specification, and a policy generator that integrates this retrieved knowledge into the learning cycle. This approach enables robots to craft adaptive responses to various scenarios and generalize to tasks beyond those in the training data. Through rigorous testing in multiple simulated and real-world settings, our system demonstrates a marked improvement in performance over conventional robotic systems, showcasing a significant breakthrough in the field of robotics.

翻译：在复杂且不确定的环境中运行的机器人面临着巨大挑战。先进的机器人系统通常依赖大规模数据集来学习操作任务。相比之下，当人类面对不熟悉的任务（例如组装椅子）时，一种常见的方法是通过观看视频演示来学习。本文提出了一种新颖的通过从视频中检索（Retrieving-from-Video，RfV）学习机器人策略的方法，利用人类演示的类比来解决操作任务。我们的系统构建了一个包含人类执行多样化日常任务记录的视频库。为了从这些视频中提取丰富知识，我们提取了中层信息（如物体可供性掩码和手部运动轨迹），作为额外输入以增强机器人模型的学习和泛化能力。我们进一步设计了一个双组件系统：一个视频检索器，用于从外部视频库中根据任务描述获取相关视频；以及一个策略生成器，将检索到的知识整合到学习循环中。该方法使机器人能够针对不同场景制定自适应响应，并泛化到训练数据之外的任务。通过在多个模拟和真实环境中的严格测试，我们的系统相较于传统机器人系统展现出显著的性能提升，标志着机器人技术领域的重要突破。