本体感知增强视觉语言模型在机器人任务描述与子任务分割中的表现 (Proprioception Enhances Vision Language Model in Generating Captions and Subtask Segmentations for Robot Task)

From the perspective of future developments in robotics, it is crucial to verify whether foundation models trained exclusively on offline data, such as images and language, can understand the robot motion. In particular, since Vision Language Models (VLMs) do not include low-level motion information from robots in their training datasets, video understanding including trajectory information remains a significant challenge. In this study, we assess two capabilities of VLMs through a video captioning task with low-level robot motion information: (1) automatic captioning of robot tasks and (2) segmentation of a series of tasks. Both capabilities are expected to enhance the efficiency of robot imitation learning by linking language and motion and serve as a measure of the foundation model's performance. The proposed method generates multiple "scene" captions using image captions and trajectory data from robot tasks. The full task caption is then generated by summarizing these individual captions. Additionally, the method performs subtask segmentation by comparing the similarity between text embeddings of image captions. In both captioning tasks, the proposed method aims to improve performance by providing the robot's motion data - joint and end-effector states - as input to the VLM. Simulator experiments were conducted to validate the effectiveness of the proposed method.

翻译：从机器人未来发展的视角来看，验证仅基于离线数据（如图像和语言）训练的基础模型能否理解机器人运动至关重要。特别地，由于视觉语言模型（VLMs）的训练数据中不包含机器人的底层运动信息，包含轨迹信息的视频理解仍然是一个重大挑战。在本研究中，我们通过带有底层机器人运动信息的视频描述任务评估了VLMs的两项能力：（1）机器人任务的自动描述生成；（2）系列任务的时序分割。这两种能力有望通过连接语言与运动来提升机器人模仿学习的效率，并可作为基础模型性能的衡量指标。所提出的方法利用机器人任务的图像描述与轨迹数据生成多个“场景”描述，随后通过汇总这些独立描述生成完整任务描述。此外，该方法通过比较图像描述文本嵌入之间的相似性来执行子任务分割。在两项描述生成任务中，所提出的方法旨在通过向VLM提供机器人运动数据——关节状态与末端执行器状态——作为输入来提升性能。通过仿真实验验证了所提方法的有效性。