LookPlanGraph：基于视觉语言模型图增强的具身指令跟随方法 (LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation)

Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread. To successfully complete tasks, the LLM must be grounded in the environment in which the robot operates. One solution is to use a scene graph that contains all the necessary information. Modern methods rely on prebuilt scene graphs and assume that all task-relevant information is available at the start of planning. However, these approaches do not account for changes in the environment that may occur between the graph construction and the task execution. We propose LookPlanGraph - a method that leverages a scene graph composed of static assets and object priors. During plan execution, LookPlanGraph continuously updates the graph with relevant objects, either by verifying existing priors or discovering new entities. This is achieved by processing the agents egocentric camera view using a Vision Language Model. We conducted experiments with changed object positions VirtualHome and OmniGibson simulated environments, demonstrating that LookPlanGraph outperforms methods based on predefined static scene graphs. To demonstrate the practical applicability of our approach, we also conducted experiments in a real-world setting. Additionally, we introduce the GraSIF (Graph Scenes for Instruction Following) dataset with automated validation framework, comprising 514 tasks drawn from SayPlan Office, BEHAVIOR-1K, and VirtualHome RobotHow. Project page available at https://lookplangraph.github.io .

翻译：利用大型语言模型作为具身指令跟随任务规划器的方法已广泛普及。为成功完成任务，LLM必须植根于机器人运行的环境之中。一种解决方案是使用包含所有必要信息的场景图。现有方法依赖于预构建的场景图，并假设所有任务相关信息在规划开始时即可获取。然而，这些方法未考虑场景图构建与任务执行期间可能发生的环境变化。我们提出LookPlanGraph——一种利用静态资产与物体先验构成的场景图的方法。在执行规划过程中，LookPlanGraph通过验证现有先验或发现新实体，持续更新相关物体的图结构。该过程通过视觉语言模型处理智能体的第一人称摄像头视图实现。我们在物体位置发生变化的VirtualHome和OmniGibson仿真环境中进行实验，证明LookPlanGraph优于基于预定义静态场景图的方法。为验证方法的实际适用性，我们还在真实场景中进行了实验。此外，我们发布了包含自动验证框架的GraSIF（指令跟随图场景）数据集，该数据集包含从SayPlan Office、BEHAVIOR-1K和VirtualHome RobotHow提取的514项任务。项目页面详见https://lookplangraph.github.io。