FinMMDocR：一个具备场景感知、文档理解与多步计算能力的金融多模态推理基准 (FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation)

Zichen Tang,Haihong E,Rongjin Li,Jiacheng Liu,Linwei Jia,Zhuodi Hao,Zhongjun Yang,Yuanze Li,Haolin Tian,Xinyi Hu,Peizhi Zhao,Yuan Liu,Zhengyu Wang,Xianghe Wang,Yiling Huang,Xueyuan Lin,Ruofei Bai,Zijian Xie,Qian Huang,Ruining Cao,Haocheng Gao

from arxiv, Accepted by AAAI-26 Main Track

We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.

翻译：我们提出了FinMMDocR，这是一个新颖的双语多模态基准，用于评估多模态大语言模型在真实世界金融数值推理任务上的表现。相较于现有基准，我们的工作实现了三项主要进展。(1) 场景感知：在1200个专家标注的问题中，57.9%的问题融入了12种类型的隐含金融场景（例如，投资组合管理），挑战模型基于假设进行专家级推理的能力；(2) 文档理解：837份中/英文文档涵盖9种类型（例如，公司研究报告），平均长度达50.8页且包含丰富的视觉元素，在金融文档的广度和深度上均显著超越了现有基准；(3) 多步计算：问题平均需要11步推理（5.3步信息提取 + 5.7步计算），其中65.0%的问题需要跨页证据（平均涉及2.4页）。表现最佳的MLLM仅达到58.0%的准确率，且不同的检索增强生成方法在此任务上表现出显著的性能差异。我们期望FinMMDocR能够推动MLLM及推理增强方法在真实世界复杂多模态推理任务上的改进。