Scene graphs provide structured semantic understanding beyond images. For downstream tasks, such as image retrieval, visual question answering, visual relationship detection, and even autonomous vehicle technology, scene graphs can not only distil complex image information but also correct the bias of visual models using semantic-level relations, which has broad application prospects. However, the heavy labour cost of constructing graph annotations may hinder the application of PSG in practical scenarios. Inspired by the observation that people usually identify the subject and object first and then determine the relationship between them, we proposed to decouple the scene graphs generation task into two sub-tasks: 1) an image segmentation task to pick up the qualified objects. 2) a restricted auto-regressive text generation task to generate the relation between given objects. Therefore, in this work, we introduce image semantic relation generation (ISRG), a simple but effective image-to-text model, which achieved 31 points on the OpenPSG dataset and outperforms strong baselines respectively by 16 points (ResNet-50) and 5 points (CLIP).
翻译:在图像检索、视觉问题解答、视觉关系探测、甚至自动车辆技术等下游任务中,场景图不仅可以提取复杂的图像信息,还可以纠正使用具有广泛应用前景的语义层关系的视觉模型的偏差,然而,构建图形图解可能会妨碍在实际情景中应用PSG。人们通常首先确定主题和对象,然后确定它们之间的关系,从这一观察中我们建议将场景图生成任务分为两个子任务:1) 收集合格对象的图像分割任务。2) 有限自动递增生成文本的任务,以生成特定对象之间的关系。因此,在这项工作中,我们引入了图像生成(ISRG),这是一个简单而有效的图像到文字模型,在OpenPSG数据集上达到了31点,并分别将强基线(ResNet-50)和5点(CLIP)排出。