使用先前知识进行视觉说明 (Relational Reasoning using Prior Knowledge for Visual Captioning)

Exploiting relationships among objects has achieved remarkable progress in interpreting images or videos by natural language. Most existing methods resort to first detecting objects and their relationships, and then generating textual descriptions, which heavily depends on pre-trained detectors and leads to performance drop when facing problems of heavy occlusion, tiny-size objects and long-tail in object detection. In addition, the separate procedure of detecting and captioning results in semantic inconsistency between the pre-defined object/relation categories and the target lexical words. We exploit prior human commonsense knowledge for reasoning relationships between objects without any pre-trained detectors and reaching semantic coherency within one image or video in captioning. The prior knowledge (e.g., in the form of knowledge graph) provides commonsense semantic correlation and constraint between objects that are not explicit in the image and video, serving as useful guidance to build semantic graph for sentence generation. Particularly, we present a joint reasoning method that incorporates 1) commonsense reasoning for embedding image or video regions into semantic space to build semantic graph and 2) relational reasoning for encoding semantic graph to generate sentences. Extensive experiments on the MS-COCO image captioning benchmark and the MSVD video captioning benchmark validate the superiority of our method on leveraging prior commonsense knowledge to enhance relational reasoning for visual captioning.

翻译：在用自然语言解释图像或视频方面,探索物体之间的关系取得了显著进展。大多数现有方法都首先使用先探测天体及其关系,然后生成文字描述,这严重依赖预先训练的探测器,导致在物体探测中遇到严重隐蔽、小尺寸天体和长尾问题时性能下降。此外,不同的探测和说明程序导致预先界定天体/关系类别和目标词汇之间的语义不一致。我们利用人类以前常识知识,在没有经过任何预先训练的探测器的情况下,在物体之间进行推理关系,并在一个图像或视频标题中达到语义一致性。先前的知识(例如,以知识图表的形式)提供了在图像和视频中未明确的物体之间常见的语义相关性和限制,作为建立语义图解图的有用指导。我们提出了一种联合推理方法,其中包括:(1) 将图像或视频区域嵌入语义空间,以构建语义图,并在一个图像或视频空间中达到语义的语义共识,以及2个图像或视频图文义的逻辑性推论,为将Semian Streagial Treal MS-regialimalimligilding Brealimalimalislationalislation 提供我们关于Smarviewmaview MS 和Smargilgilgild MS 25的图像的图像的图像推介制的图像推论。