Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.
翻译:视觉语言模型(VLMs)在空间推理基准测试中取得了显著性能,但这些评估掩盖了其在理解对象交互方面的关键缺陷。现有基准测试主要关注高级关系(如‘左侧’、‘后方’等),却忽略了实际应用所需的细粒度空间理解能力:包括精确的三维定位、对象间的物理兼容性、对象可供性以及多步骤空间规划。本研究提出了BOP-ASK——一个面向对象交互推理的大规模数据集,适用于模型训练与基准评估。我们的数据生成流程利用来自‘物体姿态估计基准’(BOP)数据集的6D物体姿态,从中衍生出细粒度标注,包括抓取姿态、参照物体姿态、路径规划轨迹、相对空间与深度关系以及物体间关联关系。BOP-ASK包含超过15万张图像和3300万个问答对,涵盖六类任务(其中四类为创新任务),为VLMs的训练与评估提供了丰富资源。我们评估了专有及开源VLMs,并在贡献型测试基准BOP-ASK-core上进行了人工评估。同时发布了BOP-ASK-lab——一个采用非BOP来源图像的分布外泛化测试基准。实验表明,基于BOP-ASK训练的模型在基线模型基础上表现更优,并展现出新兴能力,包括精确物体与抓取姿态估计、轨迹规划,以及在杂乱环境中进行细粒度的以物体为中心的空间推理。我们将公开数据集及其生成流程。