Object detection, scene graph generation and region captioning, which are three scene understanding tasks at different semantic levels, are tied together: scene graphs are generated on top of objects detected in an image with their pairwise relationship predicted, while region captioning gives a language description of the objects, their attributes, relations, and other context information. In this work, to leverage the mutual connections across semantic levels, we propose a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner. Objects, phrases, and caption regions are first aligned with a dynamic graph based on their spatial and semantic connections. Then a feature refining structure is used to pass messages across the three levels of semantic tasks through the graph. We benchmark the learned model on three tasks, and show the joint learning across three tasks with our proposed method can bring mutual improvements over previous models. Particularly, on the scene graph generation task, our proposed method outperforms the state-of-art method with more than 3% margin.
翻译:对象探测、 场景图生成和区域说明是不同语义层次的三种场景理解任务,它们相互连接: 场景图是在一个图像中检测到的物体之上产生的, 并且预测了它们之间的对称关系, 而区域说明则给出了对象、 属性、 关系和其他背景信息的语言描述。 在这项工作中, 为了利用语义层次之间的相互连接, 我们提议了一个新型神经网络模型, 称为多层次的场景描述网络( 称为 MSDN ), 以便以端到端的方式共同解决这三种愿景任务。 对象、 词组和标题区域首先与基于其空间和语义联系的动态图表一致。 然后, 使用一个功能精细结构通过图形传递三个层次的语义任务信息。 我们用三种任务来衡量所学的模型, 并展示与我们拟议方法的三大任务之间的联合学习可以带来对前三个模型的相互改进。 特别是, 在现场图表生成任务中, 我们提议的方法比标准方法超出3%以上。