Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.
翻译:多模态大语言模型(MLLMs)已在开放世界视觉理解任务中展现出强大的通用能力。然而,现有MLLMs大多侧重于整体、场景层面的理解,往往忽视了细粒度、以目标为中心的推理需求。本文提出PixelRefer,一个统一的区域级MLLM框架,能够对用户指定的图像和视频区域进行高级细粒度理解。受大语言模型注意力主要聚焦于目标级token这一观察启发,我们提出尺度自适应目标分词器(SAOT),从自由形式区域生成紧凑且语义丰富的目标表示。我们的分析表明,全局视觉token主要在大语言模型的早期层发挥作用,这启发了PixelRefer-Lite的设计——该高效变体采用以目标为中心的注入模块,将全局上下文预融合至目标token中,从而形成一个轻量级的仅目标框架,在保持高语义保真度的同时显著降低计算成本。为促进细粒度指令微调,我们构建了PixelRefer-2.2M,一个高质量以目标为中心的指令数据集。在多个基准测试上的广泛实验验证了PixelRefer能以更少训练样本取得领先性能,而PixelRefer-Lite在保持竞争力的准确率同时实现了显著的效率提升。