通过刻意练习策略优化桥接视觉语言模型与具身智能 (Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization)

Yi Zhang,Che Liu,Xiancong Ren,Hanchu Ni,Yingji Zhang,Shuai Zhang,Zeyuan Ding,Jiayu Hu,Haozhe Shan,Junbo Qi,Yan Bai,Dengjie Li,Jiachen Luo,Yidong Wang,Yong Dai,Zenglin Xu,Bin Shen,Qifan Wang,Jian Tang,Xiaozhu Ju

Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop'' training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6%. We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck and enables the community to build versatile embodied agents efficiently.

翻译：开发通用且多功能的具身智能系统面临两大主要挑战：关键的具身数据瓶颈（现实世界数据稀缺且昂贵）以及现有方法的算法效率低下（资源消耗巨大）。为克服这些限制，我们引入了刻意练习策略优化（DPPO），这是一种元认知的“元循环”训练框架，动态地在监督微调（能力扩展）与强化学习（技能精炼）之间交替。该框架能够自动识别弱点并进行针对性资源分配，专门设计用于从稀疏、有限的数据中最大化学习效率。理论上，DPPO可形式化为一个统一的偏好学习框架。实证中，使用DPPO训练一个视觉语言具身模型（称为Pelican-VL 1.0），相比基础模型实现了20.3%的性能提升，并在100B参数规模上超越开源模型10.6%。我们正在开源模型与代码，提供了首个系统性框架，以缓解数据与资源瓶颈，使社区能够高效构建多功能具身智能体。