While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.
翻译:尽管大型多模态模型(LMMs)已取得显著进展,但其本质上仍以文本为中心,依赖语言作为核心推理模态。因此,它们在处理以视觉为主导的推理任务时能力有限。近期研究试图通过使用辅助图像、深度图或图像裁剪来监督中间视觉步骤以解决此问题。然而,这些策略对“有用”视觉抽象的形式施加了限制性先验,增加了繁重的标注成本,且难以跨任务泛化。为应对这一关键局限,我们提出一种任务无关的机制,用于训练LMMs在没有显式监督的情况下发现并使用视觉推理标记。这些标记以全局注意力机制运作,并以任务自适应方式对图像进行重新编码,使模型能够在无需人工设计监督的情况下提取相关视觉信息。我们的方法在直接微调基础上实现了性能超越,并在多样化的以视觉为中心的任务上——包括那些中间抽象难以明确指定的任务——取得了最先进的结果,同时还能泛化至多任务指令微调场景。