QDepth-VLA：量化深度预测作为视觉-语言-动作模型的辅助监督 (QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models) - 专知论文

会员服务 ·

0

监督 · 空间感知 · 细粒度 · 粒度 · 三维结构 ·

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

翻译：QDepth-VLA：量化深度预测作为视觉-语言-动作模型的辅助监督

Yixuan Li,Yuhui Chen,Mingcai Zhou,Haoran Li,Zhengtao Zhang,Dongbin Zhao

Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

翻译：空间感知与推理对于视觉-语言-动作（VLA）模型完成细粒度操控任务至关重要。然而，现有方法通常缺乏对精确控制所必需的三维结构的理解与推理能力。为应对这一局限，本文提出QDepth-VLA，一种通过辅助深度预测任务增强VLA模型的通用框架。我们设计了一个专用的深度专家模块，用于预测由VQ-VAE编码器获得的深度图的量化潜在标记，使模型能够学习捕获关键几何线索的深度感知表征。在仿真基准测试和真实世界任务上的实验结果表明，QDepth-VLA在操控任务中展现出强大的空间推理能力和具有竞争力的性能。

0

相关内容

【CVPR2024】ViewDiff: 3D一致的图像生成与文本到图像模型

【CVPR2024】ViewDiff: 3D一致的图像生成与文本到图像模型

专知会员服务

30+阅读 · 2024年3月10日

【NeurIPS2022】SparCL:边缘稀疏持续学习

【NeurIPS2022】SparCL:边缘稀疏持续学习

专知会员服务

24+阅读 · 2022年9月22日

UTC: 用于视觉对话的任务间对比学习的统一Transformer

UTC: 用于视觉对话的任务间对比学习的统一Transformer

专知会员服务

14+阅读 · 2022年5月4日

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

专知会员服务

13+阅读 · 2020年4月9日

【CVPR2020-UBC】改进小样本学习视觉分类，Few-Shot Visual Classification

【CVPR2020-UBC】改进小样本学习视觉分类，Few-Shot Visual Classification

专知会员服务

68+阅读 · 2020年2月25日

AAAI 2022 | ProtGNN：自解释图神经网络

AAAI 2022 | ProtGNN：自解释图神经网络

专知

10+阅读 · 2022年2月28日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

不确定分数阶非线性系统Mittag-Leffler自适应控制

国家自然科学基金

1+阅读 · 2016年12月31日

基于多元互信息和快速稀疏多核学习的高光谱遥感影像地物分类

国家自然科学基金

0+阅读 · 2015年12月31日

“自然语言-草图”耦合的地理场景查询方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于稀疏表达理论和RGBD图像的人脸表情识别

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

Arxiv

0+阅读 · 12月24日

LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

Arxiv

0+阅读 · 12月23日

EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer

Arxiv

0+阅读 · 12月21日

InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Arxiv

0+阅读 · 12月21日

Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

Arxiv

0+阅读 · 12月19日

VIP会员

文章信息

相关主题

相关VIP内容

【CVPR2024】ViewDiff: 3D一致的图像生成与文本到图像模型

【CVPR2024】ViewDiff: 3D一致的图像生成与文本到图像模型

专知会员服务

30+阅读 · 2024年3月10日

【NeurIPS2022】SparCL:边缘稀疏持续学习

【NeurIPS2022】SparCL:边缘稀疏持续学习

专知会员服务

24+阅读 · 2022年9月22日

UTC: 用于视觉对话的任务间对比学习的统一Transformer

UTC: 用于视觉对话的任务间对比学习的统一Transformer

专知会员服务

14+阅读 · 2022年5月4日

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

专知会员服务

13+阅读 · 2020年4月9日

【CVPR2020-UBC】改进小样本学习视觉分类，Few-Shot Visual Classification

【CVPR2020-UBC】改进小样本学习视觉分类，Few-Shot Visual Classification

专知会员服务

68+阅读 · 2020年2月25日

热门VIP内容

开通专知VIP会员享更多权益服务

【书籍】从零开始构建文本生成图像生成器：基于 Transformers 与扩散模型

人工智能与未来指挥

【伯克利博士论文】将大语言模型绑定至虚拟人格：实现人类行为模拟

稀疏自编码器综述：解释大语言模型的内部机制

相关资讯

AAAI 2022 | ProtGNN：自解释图神经网络

AAAI 2022 | ProtGNN：自解释图神经网络

专知

10+阅读 · 2022年2月28日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

相关论文

VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

Arxiv

0+阅读 · 12月24日

LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

Arxiv

0+阅读 · 12月23日

EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer

Arxiv

0+阅读 · 12月21日

InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Arxiv

0+阅读 · 12月21日

Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

Arxiv

0+阅读 · 12月19日

相关基金

不确定分数阶非线性系统Mittag-Leffler自适应控制

国家自然科学基金

1+阅读 · 2016年12月31日

基于多元互信息和快速稀疏多核学习的高光谱遥感影像地物分类

国家自然科学基金

0+阅读 · 2015年12月31日

“自然语言-草图”耦合的地理场景查询方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于稀疏表达理论和RGBD图像的人脸表情识别

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

微信扫码咨询专知VIP会员