Multimodal large language models (MLLMs) have shown strong capabilities across a broad range of benchmarks. However, most existing evaluations focus on passive inference, where models perform step-by-step reasoning under complete information. This setup is misaligned with real-world use, where seeing is not enough. This raises a fundamental question: Can MLLMs actively acquire missing evidence under incomplete information? To bridge this gap, we require the MLLMs to actively acquire missing evidence and iteratively refine decisions under incomplete information, by selecting a target image from a candidate pool without task-specific priors. To support systematic study, we propose GuessBench, a benchmark with both perception-oriented and knowledge-oriented images for evaluating active reasoning in MLLMs. We evaluate 20 superior MLLMs and find that performance on active reasoning lags far behind it on passive settings, indicating substantial room for improvement. Further analysis identifies fine-grained perception and timely decision-making as key challenges. Ablation studies show that perceptual enhancements benefit smaller models, whereas thinking-oriented methods provide consistent gains across model sizes. These results suggest promising directions for future research on multimodal active reasoning.
翻译:多模态大语言模型(MLLMs)已在广泛的基准测试中展现出强大的能力。然而,现有评估大多聚焦于被动推理,即模型在完整信息下执行逐步推理。这种设置与现实应用场景不符,因为在真实世界中“所见”往往不足。这引发了一个根本性问题:MLLMs能否在不完整信息下主动获取缺失证据?为弥合这一差距,我们要求MLLMs在不依赖任务先验知识的情况下,从候选图像池中选择目标图像,以主动获取缺失证据并在不完整信息中迭代优化决策。为支持系统性研究,我们提出了GuessBench——一个包含感知导向与知识导向图像的双维度基准,用于评估MLLMs的主动推理能力。我们对20个先进MLLMs进行评估,发现其在主动推理任务上的表现远落后于被动推理场景,表明存在巨大的改进空间。进一步分析指出,细粒度感知与适时决策是当前面临的关键挑战。消融实验表明,感知增强对小规模模型更有助益,而思维导向的方法能在不同规模模型中带来持续增益。这些结果为多模态主动推理的未来研究指明了有潜力的方向。