感知先于推理：视觉语言模型中视觉推理的两阶段强化学习 (Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models)

Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model's visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks.

翻译：强化学习（RL）在激发大型语言模型（LLMs）的推理能力方面已被证明极为有效。受此成功启发，近期研究探索将类似技术应用于视觉语言模型（VLMs），旨在提升其推理性能。然而，直接将RL方法从LLMs移植到VLMs并非最优，因为VLMs面临的任务本质上更为复杂。具体而言，VLMs必须先准确感知和理解视觉输入，才能有效地进行推理。为应对这一挑战，我们提出一个两阶段强化学习框架，旨在共同增强VLMs的感知与推理能力。为缓解RL训练中常见的优势消失问题，我们首先进行数据集层面的采样，利用不同的数据源有选择地强化特定能力。在训练过程中，第一阶段侧重于通过粗粒度和细粒度的视觉理解来提升模型的视觉感知能力，而第二阶段则致力于增强推理能力。经过所提出的两阶段强化学习过程，我们获得了PeBR-R1——一个感知与推理能力显著增强的视觉语言模型。在七个基准数据集上的实验结果证明了我们方法的有效性，并验证了PeBR-R1在多样化视觉推理任务上的卓越性能。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日