Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict.
翻译:大型视觉-语言模型(VLMs)在多模态理解方面取得了显著进展,但在处理信息密集型图像(即文本标注与细粒度图形元素密集交织的图像)时仍面临推理困难。主要挑战在于精确定位密集布局中的关键线索,以及整合分散证据的多步推理。我们提出了推测性裁决(Speculative Verdict, SV),这是一个受推测解码启发的免训练框架,它结合了多个轻量级草案专家模型和一个大型裁决模型。在草案阶段,小型VLMs作为草案专家生成推理路径,提供多样化的定位候选;在裁决阶段,一个强大的VLM综合这些路径以产生最终答案,从而在恢复正确答案的同时最小化计算成本。为了进一步提高效率和准确性,SV引入了共识专家选择机制,仅将高一致性的推理路径转发给裁决模型。实证结果表明,在具有挑战性的信息密集型和高分辨率视觉问答基准测试(包括InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K)上,SV均取得了稳定的性能提升。通过综合多个部分准确的推理路径中的正确见解,与大型专有模型或训练流程相比,SV同时实现了错误校正和成本效益。代码可在 https://github.com/Tinaliu0123/speculative-verdict 获取。