Existing vision-language models often suffer from spatial hallucinations, i.e., generating incorrect descriptions about the relative positions of objects in an image. We argue that this problem mainly stems from the asymmetric properties between images and text. To enrich the spatial understanding ability of vision-language models, we propose a simple, annotation-free, plug-and-play method named $\text{Stitch and Tell}$ (abbreviated as SiTe), which injects structured spatial supervision into data. It constructs stitched image-text pairs by stitching images along a spatial axis and generating spatially-aware captions or question answer pairs based on the layout of stitched image, without relying on costly advanced models or human involvement. We evaluate SiTe across three architectures including LLaVA-v1.5-7B, LLaVA-Qwen2-1.5B and HALVA-7B, two training datasets, and eight benchmarks. Experiments show that SiTe improves spatial understanding tasks such as $\text{MME}_{\text{Position}}$ (+5.50%) and Spatial-MM (+4.19%), while maintaining or improving performance on general vision-language benchmarks including COCO-QA (+1.02%) and MMBench (+4.76%). Our findings suggest that explicitly injecting spatially-aware structure into training data offers an effective way to mitigate spatial hallucinations and improve spatial understanding, while preserving general vision-language capabilities.
翻译:现有的视觉语言模型常遭受空间幻觉问题,即生成关于图像中物体相对位置的不正确描述。我们认为该问题主要源于图像与文本之间的不对称特性。为增强视觉语言模型的空间理解能力,我们提出了一种简单、无需标注、即插即用的方法,命名为 $\\text{Stitch and Tell}$(简称 SiTe),该方法将结构化空间监督注入数据中。它通过沿空间轴缝合图像,并根据缝合图像的布局生成具有空间感知的标题或问答对,从而构建缝合的图像-文本对,而无需依赖成本高昂的高级模型或人工参与。我们在三种架构(包括 LLaVA-v1.5-7B、LLaVA-Qwen2-1.5B 和 HALVA-7B)、两个训练数据集和八个基准测试上评估了 SiTe。实验表明,SiTe 在空间理解任务上有所提升,如 $\\text{MME}_{\\text{Position}}$(+5.50%)和 Spatial-MM(+4.19%),同时在通用视觉语言基准测试(包括 COCO-QA(+1.02%)和 MMBench(+4.76%))上保持或改进了性能。我们的研究结果表明,将具有空间感知的结构显式注入训练数据,提供了一种有效缓解空间幻觉并提升空间理解能力的方法,同时保留了通用视觉语言能力。