Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM$^3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.
翻译:复杂的图表理解任务要求多模态大语言模型具备高级的视觉识别与推理能力。然而,当前研究对现实应用中普遍存在的复杂图表场景及计算密集型推理任务的覆盖有限。本研究提出一种自动化的多阶段代码驱动流水线,用于系统生成视觉推理数据集以应对这些局限。该流水线集成检索增强生成技术以检索专业图表模板,并采用思维链策略生成模拟真实数据分布的推理代码,从而驱动图表渲染及与问题相关的统计计算。通过基于模型的评估,该流水线提升了图表的多样性与数据质量。利用此框架,我们构建了ChartM$^3$,一个包含38K张图表和142K个问答对用于训练的多维多步数据集,以及2,871个高质量评估样本以实现实际性能评估。监督微调与强化学习实验表明,我们的数据集显著提升了推理能力与跨领域泛化性能,使较小模型在复杂图表理解任务中能够达到与更大规模模型相当的表现。