UniPixel：面向像素级视觉推理的统一对象指代与分割 (UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning)

Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.

翻译：近年来，大型多模态模型（LMMs）作为通用多模态助手取得了显著成功，其关注点主要集中在整体图像与视频-语言理解上。相比之下，针对细粒度像素级理解能力的扩展研究则相对较少——这类任务要求模型实现视觉信号与语言语义在像素级别的对齐。先前已有研究将LMMs应用于区域级描述生成和指代表达式分割等相关任务，但这些模型仅能独立执行指代或分割任务，未能将这些细粒度感知能力整合到视觉推理中。为弥补这一不足，我们提出了UniPixel，一种能够灵活理解视觉提示输入并生成掩码锚定响应的大型多模态模型。该模型的突出特点在于将像素级感知能力与通用视觉理解能力无缝集成。具体而言，UniPixel可处理视觉提示并按需生成相关掩码，在推理过程中基于这些中间指针进行后续条件推理，从而实现细粒度的像素级推理。我们在涵盖图像/视频中像素级指代/分割及以对象为中心的理解等多样化任务的10个基准测试中验证了方法的有效性。此外，我们还设计了新颖的PixelQA任务（需同时进行指代、分割和问答）以验证本方法的灵活性。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日