典型的医学图像分割任务的业绩估计准确程度如何? (How precise are performance estimates for typical medical image segmentation tasks?)

An important issue in medical image processing is to be able to estimate not only the performances of algorithms but also the precision of the estimation of these performances. Reporting precision typically amounts to reporting standard-error of the mean (SEM) or equivalently confidence intervals. However, this is rarely done in medical image segmentation studies. In this paper, we aim to estimate what is the typical confidence that can be expected in such studies. To that end, we first perform experiments for Dice metric estimation using a standard deep learning model (U-net) and a classical task from the Medical Segmentation Decathlon. We extensively study precision estimation using both Gaussian assumption and bootstrapping (which does not require any assumption on the distribution). We then perform simulations for other test set sizes and performance spreads. Overall, our work shows that small test sets lead to wide confidence intervals ($\sim$6 points of Dice for 20 samples) and that, in order to obtain a confidence interval narrower than 2, it is necessary to have at least 200 test samples.

翻译：医学图像处理的一个重要问题是,不仅能够估计算法的性能,而且能够估计这些性能的精确度。报告精确度通常相当于报告平均(SEM)的标准误差或相当的置信度间隔。然而,在医学图像分割研究中,这种情况很少发生。在本文中,我们的目标是估计这类研究所期望的典型信任度。为此,我们首先使用标准的深层次学习模型(U-net)和医学剖析Decathlon的经典任务进行Dice量测试验。我们利用高斯假设和靴索(不需要在分布上作任何假设)广泛研究精确度估测。然后,我们模拟其他测试集的大小和性能分布。总体而言,我们的工作表明,小型测试组可导致宽度的置信度间隔(20个样本为0.6个Dice点),为了获得小于2个的信任度间隔,我们至少需要200个测试样品。