抽样选择的空洞:关于肺结核分类的个案研究 (The Pitfalls of Sample Selection: A Case Study on Lung Nodule Classification)

Vasileios Baltatzis,Kyriaki-Margarita Bintsi,Loic Le Folgoc,Octavio E. Martinez Manzanera,Sam Ellis,Arjun Nair,Sujal Desai,Ben Glocker,Julia A. Schnabel

from arxiv, Accepted at PRIME, MICCAI 2021

Using publicly available data to determine the performance of methodological contributions is important as it facilitates reproducibility and allows scrutiny of the published results. In lung nodule classification, for example, many works report results on the publicly available LIDC dataset. In theory, this should allow a direct comparison of the performance of proposed methods and assess the impact of individual contributions. When analyzing seven recent works, however, we find that each employs a different data selection process, leading to largely varying total number of samples and ratios between benign and malignant cases. As each subset will have different characteristics with varying difficulty for classification, a direct comparison between the proposed methods is thus not always possible, nor fair. We study the particular effect of truthing when aggregating labels from multiple experts. We show that specific choices can have severe impact on the data distribution where it may be possible to achieve superior performance on one sample distribution but not on another. While we show that we can further improve on the state-of-the-art on one sample selection, we also find that on a more challenging sample selection, on the same database, the more advanced models underperform with respect to very simple baseline methods, highlighting that the selected data distribution may play an even more important role than the model architecture. This raises concerns about the validity of claimed methodological contributions. We believe the community should be aware of these pitfalls and make recommendations on how these can be avoided in future work.

翻译：使用公开可得的数据来确定方法贡献的绩效十分重要,因为它有助于重新复制,并允许对公布的结果进行检查。例如,在肺结核分类中,许多工作报告关于公开提供的LIDC数据集的结果。理论上,这应该能够直接比较拟议方法的绩效并评估个人贡献的影响。然而,在分析最近7个工作时,我们发现每个工作都采用不同的数据选择程序,导致良性案例和恶性案例之间的样本和比率的总数大不相同。由于每个子组将具有不同的特点,在分类方面有不同的困难,因此,对拟议方法进行直接比较是不可能的,也是不公平的。我们在汇集多位专家的标签时,我们研究了解真相的特殊影响。我们表明,具体选择可能对数据分布产生严重影响,从而有可能在一个样本分布上取得优异性业绩,而不是在另一个样本中,我们发现每个组都采用不同的数据选择过程,导致一个样本和恶性案例的总数大不相同。我们发现,在同一个数据库中,在非常简单的基线方法方面,更先进的模型是不可能做到公平。我们所选择的模型,强调所选定的数据分配方式的正确性会提高这些方法结构的作用。