Estimating the expected output quality of generation systems is central to NLG. This paper qualifies the notion that automatic metrics are not as good as humans in estimating system-level quality. Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators. We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap. Measuring this error is complicated: predictions are evaluated against noisy, human predicted labels instead of the ground truth, and metric predictions fluctuate based on the test sets they were calculated on. By applying a bias-variance-noise decomposition, we adjust this error to a noise-free, infinite test set setting. Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected. In MT, we identify two settings where metrics outperform humans due to a statistical advantage in variance: when the number of human judgments used is small, and when the quality difference between compared systems is small. The data and code to reproduce our analyses are available at https://github.com/johntzwei/metric-statistical-advantage .
翻译:估计发电系统的预期产出质量是NLG的核心。 本文限定了自动计量在估计系统质量方面不如人类好的概念。 从统计上看, 人类是不带偏见的, 高差异估测器, 而测量器是偏差的, 低差异估测器。 我们用靴子陷阱来比较这些估算器。 测量这一错误是复杂的: 预测是用噪音、 人类预测标签而不是地面真理来评价的, 以及根据它们所计算的测试组进行衡量的预测是波动的。 通过应用偏差- 噪音分解法, 我们将这一错误调整为无噪音的、 无限的测试组。 我们的分析将经调整的测量误差与人类进行比较, 并用衍生的、 完美的分层级说明器进行比较, 两者都是公正的估计器, 取决于所收集的判断数量。 在MT中, 我们发现两种环境, 衡量尺度由于统计上的优势而超越了人类的尺寸: 当人类判断数量小时, 当我们使用的是质量/ 数据在可变码上可得到的质量差异时, 。