PixelRefer：一种支持任意粒度时空目标指代的统一框架 (PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity)

Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.

翻译：多模态大语言模型（MLLMs）已在开放世界视觉理解任务中展现出强大的通用能力。然而，现有MLLMs大多侧重于整体、场景层面的理解，往往忽视了细粒度、以目标为中心的推理需求。本文提出PixelRefer，一个统一的区域级MLLM框架，能够对用户指定的图像和视频区域进行高级细粒度理解。受大语言模型注意力主要聚焦于目标级token这一观察启发，我们提出尺度自适应目标分词器（SAOT），从自由形式区域生成紧凑且语义丰富的目标表示。我们的分析表明，全局视觉token主要在大语言模型的早期层发挥作用，这启发了PixelRefer-Lite的设计——该高效变体采用以目标为中心的注入模块，将全局上下文预融合至目标token中，从而形成一个轻量级的仅目标框架，在保持高语义保真度的同时显著降低计算成本。为促进细粒度指令微调，我们构建了PixelRefer-2.2M，一个高质量以目标为中心的指令数据集。在多个基准测试上的广泛实验验证了PixelRefer能以更少训练样本取得领先性能，而PixelRefer-Lite在保持竞争力的准确率同时实现了显著的效率提升。