The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .
翻译:AI智能体“用图像思考”的能力需要推理与感知的复杂融合。然而,当前开源的智能体在现实任务(如分析包含密集图表/示意图的文档和导航地图)所必需的推理方面仍存在明显不足。为弥补这一差距,我们引入了O3-Bench——一个旨在评估对视觉细节进行交错关注的多模态推理能力的新基准。O3-Bench包含一系列具有挑战性的问题,要求智能体通过多步推理,从图像的不同区域整合细微的视觉信息。即使对于OpenAI o3等前沿系统,这些问题也极具挑战性,其在O3-Bench上的准确率仅为40.8%。为取得进展,我们提出了InSight-o3,这是一个由视觉推理智能体(vReasoner)和视觉搜索智能体(vSearcher)组成的多智能体框架。针对vSearcher,我们引入了广义视觉搜索任务——即超越自然图像中简单物体或图形的定位,能够根据自由形式的语言描述,定位关系型、模糊或概念性的区域。随后,我们提出了一种通过强化学习专门为此任务训练的多模态大语言模型。作为一个即插即用的智能体,我们的vSearcher能够赋能前沿多模态模型(作为vReasoner),显著提升其在广泛基准测试上的性能。这标志着向构建强大的类o3开源系统迈出了坚实的一步。我们的代码与数据集可在https://github.com/m-Just/InSight-o3 获取。