The recent developments in the machine learning domain have enabled the development of complex multivariate probabilistic forecasting models. Therefore, it is pivotal to have a precise evaluation method to gauge the performance and predictability power of these complex methods. To do so, several evaluation metrics have been proposed in the past (such as Energy Score, Dawid-Sebastiani score, variogram score), however, they cannot reliably measure the performance of a probabilistic forecaster. Recently, CRPS-sum has gained a lot of prominence as a reliable metric for multivariate probabilistic forecasting. This paper presents a systematic evaluation of CRPS-sum to understand its discrimination ability. We show that the statistical properties of target data affect the discrimination ability of CRPS-Sum. Furthermore, we highlight that CRPS-Sum calculation overlooks the performance of the model on each dimension. These flaws can lead us to an incorrect assessment of model performance. Finally, with experiments on the real-world dataset, we demonstrate that the shortcomings of CRPS-Sum provide a misleading indication of the probabilistic forecasting performance method. We show that it is easily possible to have a better CRPS-Sum for a dummy model, which looks like random noise, in comparison to the state-of-the-art method.
翻译:机器学习领域最近的发展使复杂的多变概率预测模型得以发展,因此,有一个精确的评价方法来衡量这些复杂方法的性能和可预测性能力至关重要,为此,过去提出了若干评价指标(例如能源评分、达维德-塞巴斯蒂亚尼评分、变式图评分),但是,这些评价指标无法可靠地衡量概率预测员的性能。最近,CRPS总和作为多变概率预测的可靠衡量标准,获得了很多显著地位。本文对CRPS总和进行了系统评价,以了解其差别性能。我们表明,目标数据的统计特性影响到CRPS-Sum的差别性能。此外,我们强调,CRPS-S-Sum的计算忽略了每个方面模型的性能。这些缺陷可以使我们对模型性能进行不正确的评估。最后,通过对真实世界数据集的实验,我们证明CRPS-S-Sum的缺点对概率预测方法的缺点提供了一种误导性表示。我们表明,目标数据数据的统计特性会影响CRPS-S-S的随机性比较方法,我们很容易地看到,这是一种模拟的比较方法。