Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical aspects: (1) The lack of quantifiable difficulty metrics capable of strategically screening samples for post-training optimization. (2) Suboptimal post-training paradigms that fail to jointly optimize perception and reasoning capabilities. To address this gap, we propose two novel difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation, while Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity via attention distribution analysis. Leveraging these metrics, we design a hierarchical training framework that incorporates both GRPO-only and SFT+GRPO hybrid training paradigms, and evaluate them across six benchmark datasets. Experiments demonstrate consistent superiority of GRPO applied to difficulty-stratified samples compared to conventional SFT+GRPO pipelines, indicating that strategic data sampling can obviate the need for supervised fine-tuning while improving model accuracy. Our code will be released at https://github.com/qijianyu277/DifficultySampling.
翻译:多模态大语言模型(MLLMs)的最新进展显著推动了思维链(CoT)推理的发展。基于Deepseek-R1的成功,研究者将多模态推理扩展至基于强化学习(RL)的后训练范式,主要聚焦于数学数据集。然而,现有后训练范式往往忽略两个关键方面:(1)缺乏可量化的难度度量指标,无法策略性地筛选样本以优化后训练;(2)次优的后训练范式未能联合优化感知与推理能力。为弥补这一不足,我们提出两种新颖的难度感知采样策略:渐进式图像语义掩码(PISM)通过系统性图像退化量化样本难度,而跨模态注意力均衡(CMAB)则通过注意力分布分析评估跨模态交互复杂度。基于这些指标,我们设计了一个分层训练框架,整合了纯GRPO与SFT+GRPO混合训练范式,并在六个基准数据集上进行评估。实验表明,相较于传统的SFT+GRPO流程,应用于难度分层样本的GRPO方法始终表现出更优性能,这表明策略性数据采样可在提升模型精度的同时避免监督微调的需求。我们的代码将在https://github.com/qijianyu277/DifficultySampling发布。