利用标注与未标注数据评估多个模型 (Evaluating multiple models using labeled and unlabeled data)

It remains difficult to evaluate machine learning classifiers in the absence of a large, labeled dataset. While labeled data can be prohibitively expensive or impossible to obtain, unlabeled data is plentiful. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. SSME is the first evaluation method to take advantage of the fact that: (i) there are frequently multiple classifiers for the same task, (ii) continuous classifier scores are often available for all classes, and (iii) unlabeled data is often far more plentiful than labeled data. The key idea is to use a semi-supervised mixture model to estimate the joint distribution of ground truth labels and classifier predictions. We can then use this model to estimate any metric that is a function of classifier scores and ground truth labels (e.g., accuracy or expected calibration error). We present experiments in four domains where obtaining large labeled datasets is often impractical: (1) healthcare, (2) content moderation, (3) molecular property prediction, and (4) image annotation. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1x relative to using labeled data alone and 2.4x relative to the next best competing method. SSME also improves accuracy when evaluating performance across subsets of the test distribution (e.g., specific demographic subgroups) and when evaluating the performance of language models.

翻译：在缺乏大规模标注数据集的情况下，评估机器学习分类器仍然具有挑战性。尽管标注数据的获取可能成本极高甚至无法实现，未标注数据却十分丰富。本文提出半监督模型评估（SSME）方法，该方法同时利用标注与未标注数据来评估机器学习分类器。SSME是首个充分利用以下事实的评估方法：（i）同一任务通常存在多个分类器，（ii）所有类别的连续分类器得分通常可获得，以及（iii）未标注数据往往远多于标注数据。其核心思想是采用半监督混合模型来估计真实标签与分类器预测的联合分布。随后可利用该模型估计任何基于分类器得分与真实标签的度量指标（例如准确率或期望校准误差）。我们在四个通常难以获取大规模标注数据的领域进行了实验：（1）医疗健康，（2）内容审核，（3）分子性质预测，以及（4）图像标注。实验结果表明，相较于现有方法，SSME能更精确地估计模型性能——相对于仅使用标注数据的方法将误差降低了5.1倍，相对于次优竞争方法将误差降低了2.4倍。在评估测试数据分布子集（如特定人口统计子群）的性能时，以及在评估语言模型性能时，SSME同样提升了评估准确性。