重新思考对无偏见的景象生成的评估 (Rethinking the Evaluation of Unbiased Scene Graph Generation)

Since the severe imbalanced predicate distributions in common subject-object relations, current Scene Graph Generation (SGG) methods tend to predict frequent predicate categories and fail to recognize rare ones. To improve the robustness of SGG models on different predicate categories, recent research has focused on unbiased SGG and adopted mean Recall@K (mR@K) as the main evaluation metric. However, we discovered two overlooked issues about this de facto standard metric mR@K, which makes current unbiased SGG evaluation vulnerable and unfair: 1) mR@K neglects the correlations among predicates and unintentionally breaks category independence when ranking all the triplet predictions together regardless of the predicate categories, leading to the performance of some predicates being underestimated. 2) mR@K neglects the compositional diversity of different predicates and assigns excessively high weights to some oversimple category samples with limited composable relation triplet types. It totally conflicts with the goal of SGG task which encourages models to detect more types of visual relationship triplets. In addition, we investigate the under-explored correlation between objects and predicates, which can serve as a simple but strong baseline for unbiased SGG. In this paper, we refine mR@K and propose two complementary evaluation metrics for unbiased SGG: Independent Mean Recall (IMR) and weighted IMR (wIMR). These two metrics are designed by considering the category independence and diversity of composable relation triplets, respectively. We compare the proposed metrics with the de facto standard metrics through extensive experiments and discuss the solutions to evaluate unbiased SGG in a more trustworthy way.

翻译：由于共同主题关系中严重不平衡的上游分布,当前Scene Grage Game(SGG)方法往往预测频繁的上游类别,而且不承认稀有类别。为了提高SGG模型在不同上游类别上的稳健性,最近的研究侧重于不偏向的SGG模型,并采用平均回调@K(mR@K)作为主要评价指标。然而,我们发现关于这种事实上的标准 mR@K的两个被忽视的问题,它使得当前不公正的SGG评价变得脆弱和不公平:(1) mR@K忽略了上游和无意间打破类别独立性之间的相互关系,将所有三重预测排列在一起,导致某些上游类别的预测表现被低估。(2) mR@K忽视了不同上游的构成多样性,对一些具有有限可调和关系类型的过于宽松的类别样本赋予了过高的重力。与SGGGG任务的目标完全相矛盾,即鼓励模型发现更多的可视关系三重解决方案。此外,我们调查了对象与上游之间在深度和上游之间的相关性下的相关性,这可以作为更简单但有力的标准基线,通过SGGRMR(我们讨论两种更精确的衡量标准) 和标准上更精确的标准化评估。