Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict.
翻译:大型视觉-语言模型(VLMs)在多模态理解方面取得了显著进展,但在处理信息密集型图像时仍面临挑战,这类图像密集地交织着文本标注与细粒度图形元素。主要难点在于精确定位密集布局中的关键线索,以及通过多跳推理整合分散的证据。我们提出了推测裁决(SV),这是一个受推测解码启发的免训练框架,它结合了多个轻量级草案专家与一个大型裁决模型。在草案阶段,小型VLMs作为草案专家生成推理路径,提供多样化的定位候选;在裁决阶段,一个强大的VLM综合这些路径以产生最终答案,从而在最小化计算成本的同时恢复正确答案。为了进一步提高效率和准确性,SV引入了共识专家选择机制,仅将高一致性的推理路径转发给裁决模型。实证表明,SV在具有挑战性的信息密集型和高分辨率视觉问答基准测试中取得了持续增益,包括InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K。通过综合多个部分准确的推理路径中的正确见解,与大型专有模型或训练流程相比,SV实现了错误纠正和成本效益。代码可在https://github.com/Tinaliu0123/speculative-verdict获取。