Visual relationship detection can bridge the gap between computer vision and natural language for scene understanding of images. Different from pure object recognition tasks, the relation triplets of subject-predicate-object lie on an extreme diversity space, such as \textit{person-behind-person} and \textit{car-behind-building}, while suffering from the problem of combinatorial explosion. In this paper, we propose a context-dependent diffusion network (CDDN) framework to deal with visual relationship detection. To capture the interactions of different object instances, two types of graphs, word semantic graph and visual scene graph, are constructed to encode global context interdependency. The semantic graph is built through language priors to model semantic correlations across objects, whilst the visual scene graph defines the connections of scene objects so as to utilize the surrounding scene information. For the graph-structured data, we design a diffusion network to adaptively aggregate information from contexts, which can effectively learn latent representations of visual relationships and well cater to visual relationship detection in view of its isomorphic invariance to graphs. Experiments on two widely-used datasets demonstrate that our proposed method is more effective and achieves the state-of-the-art performance.
翻译:视觉关系探测可以缩小计算机视觉和自然语言之间的距离, 以便了解图像的现场理解。 与纯粹的天体识别任务不同, 主题预测对象的三重关系存在于极端多样性空间上, 如 \ textit{ person- behind- person} 和\ textit{car- beind- building}, 同时也受到组合爆炸问题的影响。 在本文中, 我们建议建立一个基于环境的传播网络( CDDN) 框架, 以处理视觉关系探测。 为了捕捉不同对象的相互作用, 两种类型的图形, 即单词语义图和视觉场景图, 用来编码全球环境的相互依存性。 语义图是先用语言构建的, 以模型显示各个对象之间的语义相关性, 而视觉场景图则用来界定犯罪现场物体的连接, 以便利用周围的场景信息。 对于图形结构的数据, 我们设计一个基于环境的适应性综合信息的传播网络, 可以有效地学习视觉关系的潜在表现, 并且能够适应视觉关系的视觉关系探测, 和视觉关系探测看它的视像形变形图的图像, 。 实验显示我们所拟议的两种状态是比较有效的方法。