Traditional scene graph generation methods are trained using cross-entropy losses that treat objects and relationships as independent entities. Such a formulation, however, ignores the structure in the output space, in an inherently structured prediction problem. In this work, we introduce a novel energy-based learning framework for generating scene graphs. The proposed formulation allows for efficiently incorporating the structure of scene graphs in the output space. This additional constraint in the learning framework acts as an inductive bias and allows models to learn efficiently from a small number of labels. We use the proposed energy-based framework to train existing state-of-the-art models and obtain a significant performance improvement, of up to 21% and 27%, on the Visual Genome and GQA benchmark datasets, respectively. Furthermore, we showcase the learning efficiency of the proposed framework by demonstrating superior performance in the zero- and few-shot settings where data is scarce.
翻译:传统的景象图生成方法是使用跨热带损失来培训的,这些损失将物体和关系作为独立实体对待。然而,这种提法忽略了产出空间的结构,在固有的结构化的预测问题中,这种提法忽视了产出空间的结构。在这项工作中,我们采用了一个新的能源学习框架来生成景象图。拟议的提法可以有效地将景象图结构纳入产出空间。学习框架中的这一额外制约起到感应偏差的作用,使模型能够有效地从少数标签中学习。我们利用拟议的能源框架来培训现有的最新模型,并大大改进视觉基因组和GQA基准数据集的性能,分别达到21%和27%。此外,我们还展示了拟议框架的学习效率,在数据稀少的零光和少光谱环境中展示了优异的性能。