Conformal prediction is a model-agnostic approach to generating prediction sets that cover the true class with a high probability. Although its prediction set size is expected to capture aleatoric uncertainty, there is a lack of evidence regarding its effectiveness. The literature presents that prediction set size can upper-bound aleatoric uncertainty or that prediction sets are larger for difficult instances and smaller for easy ones, but a validation of this attribute of conformal predictors is missing. This work investigates how effectively conformal predictors quantify aleatoric uncertainty, specifically the inherent ambiguity in datasets caused by overlapping classes. We perform this by measuring the correlation between prediction set sizes and the number of distinct labels assigned by human annotators per instance. We further assess the similarity between prediction sets and human-provided annotations. We use three conformal prediction approaches to generate prediction sets for eight deep learning models trained on four datasets. The datasets contain annotations from multiple human annotators (ranging from five to fifty participants) per instance, enabling the identification of class overlap. We show that the vast majority of the conformal prediction outputs show a very weak to weak correlation with human annotations, with only a few showing moderate correlation. These findings underscore the necessity of critically reassessing the prediction sets generated using conformal predictors. While they can provide a higher coverage of the true classes, their capability in capturing aleatoric uncertainty and generating sets that align with human annotations remains limited.
翻译:保形预测是一种模型无关的方法,用于生成以高概率覆盖真实类别的预测集合。尽管其预测集合的大小预期能够捕捉偶然不确定性,但关于其有效性的证据尚显不足。现有文献指出,预测集合的大小可以上界偶然不确定性,或者预测集合对于困难实例较大、对于简单实例较小,但这一保形预测器属性的验证仍属缺失。本研究探讨了保形预测器量化偶然不确定性的有效性,特别是数据集中由类别重叠引起的固有模糊性。我们通过测量预测集合大小与每个实例由人类标注者分配的不同标签数量之间的相关性来实现这一目标。我们进一步评估了预测集合与人类提供的标注之间的相似性。我们采用三种保形预测方法,为在四个数据集上训练的八个深度学习模型生成预测集合。这些数据集每个实例包含多个人类标注者的标注(参与者数量从五到五十不等),从而能够识别类别重叠。结果表明,绝大多数保形预测输出与人类标注之间的相关性非常弱至弱,仅少数呈现中等相关性。这些发现强调了有必要对使用保形预测器生成的预测集合进行批判性重新评估。虽然它们能够提供对真实类别更高的覆盖率,但在捕捉偶然不确定性以及生成与人类标注一致的集合方面,其能力仍然有限。