Ranking methods or models based on their performance is of prime importance but is tricky because performance is fundamentally multidimensional. In the case of classification, precision and recall are scores with probabilistic interpretations that are both important to consider and complementary. The rankings induced by these two scores are often in partial contradiction. In practice, therefore, it is extremely useful to establish a compromise between the two views to obtain a single, global ranking. Over the last fifty years or so,it has been proposed to take a weighted harmonic mean, known as the F-score, F-measure, or $F_β$. Generally speaking, by averaging basic scores, we obtain a score that is intermediate in terms of values. However, there is no guarantee that these scores lead to meaningful rankings and no guarantee that the rankings are good tradeoffs between these base scores. Given the ubiquity of $F_β$ scores in the literature, some clarification is in order. Concretely: (1) We establish that $F_β$-induced rankings are meaningful and define a shortest path between precision- and recall-induced rankings. (2) We frame the problem of finding a tradeoff between two scores as an optimization problem expressed with Kendall rank correlations. We show that $F_1$ and its skew-insensitive version are far from being optimal in that regard. (3) We provide theoretical tools and a closed-form expression to find the optimal value for $β$ for any distribution or set of performances, and we illustrate their use on six case studies.
翻译:基于性能对方法或模型进行排序至关重要,但由于性能本质上是多维的,这一过程具有挑战性。在分类任务中,精确率和召回率是具有概率解释的评分指标,两者均需考虑且互为补充。由这两个分数诱导的排序常存在部分矛盾。因此,在实践中,建立两种视角间的折中以获得单一全局排序极为重要。过去约五十年间,学界提出采用加权调和平均数,即F分数、F度量或$F_β$。一般而言,通过对基础分数取平均,我们得到一个数值上居中的分数。然而,这无法保证这些分数能产生有意义的排序,也无法保证排序是这些基础分数间的良好折中。鉴于$F_β$分数在文献中的普遍性,有必要进行澄清。具体而言:(1) 我们证明$F_β$诱导的排序具有意义,并定义了精确率与召回率诱导排序之间的最短路径。(2) 我们将两个分数间的折中问题构建为用肯德尔秩相关系数表达的优化问题,并证明$F_1$及其偏斜不敏感版本在此方面远非最优。(3) 我们提供理论工具和闭式解,可为任意性能分布或集合找到$β$的最优值,并通过六个案例研究说明其应用。