Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at https://github.com/scenario-clip/ScenarioCLIP
翻译:迄今为止,CLIP类型基础模型的通用语料库广泛探索了短描述的检索或将场景中的对象分类作为单对象图像分类任务。给定文本提示检索图像嵌入(图像检索任务)的情况也类似。然而,真实世界场景图像展现出涉及多个对象与动作的丰富组合结构。基于CLIP的最新方法通过挖掘更难的负样本图像-文本对以及优化永久文本提示(常使用LLM)来提升类别级区分能力,但这些改进仍局限于预定义的类别列表,且未显式建模关系或组合结构。PyramidCLIP通过对齐全局与局部视觉特征部分解决了这一差距,但仍缺乏对对象间关系的显式建模。因此,为在场景分析中进一步利用这一特性,本文提出的ScenarioCLIP模型接受输入文本、基于实体的关系、输入图像以及突出关系的关键区域作为输入。该模型在精心构建的场景数据上进行预训练,并针对跨模态检索和细粒度视觉理解等专业下游任务进行微调。为解决领域特定数据集的缺乏,我们通过扩展现有公开的多样化室内外场景数据集中的图像-文本对,生成了一套新颖的数据集。我们采用现有语言模型的流程来锚定动作、对象及关系,并通过人工与自动筛选进行完善。我们为多项基于场景的任务建立了综合基准,并与多种基线方法进行了比较。ScenarioCLIP在多种领域特定任务上展现出强大的零样本和微调性能。我们的代码与数据集发布于https://github.com/scenario-clip/ScenarioCLIP。