Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average. Code, benchmark, and data will be available at https://github.com/q1xiangchen/OpenView.
翻译:近年来,多模态大语言模型(MLLMs)在自然图像理解方面展现出巨大潜力。然而,它们主要擅长对图像帧内的可见内容进行推理。本文首次针对视野外(OOV)理解展开研究,即对透视视图可见帧之外的物体、活动和场景进行推理的能力。我们的技术贡献包括三个方面。首先,我们设计了OpenView,一个四阶段流程,通过利用全景图像大规模生成多项选择视觉问答,实现具有丰富上下文和空间基础的、支持自由取景的视觉问答合成。其次,我们构建了OpenView-Dataset,一个从多样化真实世界全景图生成的高质量合成数据集,旨在通过监督微调增强多模态大语言模型的能力。第三,我们建立了OpenView-Bench,一个联合评估选择准确性和推理准确性的基准,以实现可解释和可诊断的评估。实验结果表明,尽管在视野外视觉问答答案选择方面与人类表现存在较大差距,但在OpenView的赋能下,多个多模态大语言模型能够持续提升其性能,平均准确率从48.6%提升至64.1%。代码、基准和数据将在 https://github.com/q1xiangchen/OpenView 提供。