ScenarioCLIP：用于自然场景分析的预训练可迁移视觉语言模型与Action-Genome数据集 (ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis)

Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at https://github.com/scenario-clip/ScenarioCLIP

翻译：迄今为止，CLIP类型基础模型的通用语料库广泛探索了短描述的检索或将场景中的对象分类作为单对象图像分类任务。给定文本提示检索图像嵌入（图像检索任务）的情况也类似。然而，真实世界场景图像展现出涉及多个对象与动作的丰富组合结构。基于CLIP的最新方法通过挖掘更难的负样本图像-文本对以及优化永久文本提示（常使用LLM）来提升类别级区分能力，但这些改进仍局限于预定义的类别列表，且未显式建模关系或组合结构。PyramidCLIP通过对齐全局与局部视觉特征部分解决了这一差距，但仍缺乏对对象间关系的显式建模。因此，为在场景分析中进一步利用这一特性，本文提出的ScenarioCLIP模型接受输入文本、基于实体的关系、输入图像以及突出关系的关键区域作为输入。该模型在精心构建的场景数据上进行预训练，并针对跨模态检索和细粒度视觉理解等专业下游任务进行微调。为解决领域特定数据集的缺乏，我们通过扩展现有公开的多样化室内外场景数据集中的图像-文本对，生成了一套新颖的数据集。我们采用现有语言模型的流程来锚定动作、对象及关系，并通过人工与自动筛选进行完善。我们为多项基于场景的任务建立了综合基准，并与多种基线方法进行了比较。ScenarioCLIP在多种领域特定任务上展现出强大的零样本和微调性能。我们的代码与数据集发布于https://github.com/scenario-clip/ScenarioCLIP。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

[ICCV2025]EAMamba：面向图像恢复的高效全能视觉状态空间模型

专知会员服务

5+阅读 · 7月1日

【KDD2024】HiGPT:异构图语言模型

专知会员服务

19+阅读 · 2024年7月9日

RAG与RAU：自然语言处理中的检索增强语言模型综述

专知会员服务

87+阅读 · 2024年5月3日

【CVPR2024】ViewDiff: 3D一致的图像生成与文本到图像模型

专知会员服务

30+阅读 · 2024年3月10日