State-of-the-art machine learning often follows a two-stage process: $(i)$~pre-training on large, general-purpose datasets; $(ii)$~fine-tuning on task-specific data. In fine-tuning, selecting training examples that closely reflect the target distribution is crucial. However, it is often the case that only a few samples are available from the target distribution. Existing data selection methods treat these target samples as a validation set and estimate the effect of adding or removing a single sample from the training pool by performing inference on the validation set. We propose a simpler and faster alternative that inverts the usual role of train and validation: we perform inference on the training pool before and after fine-tuning on the validation set. We then select samples whose predictions change the most. Our key insight is that the training samples most affected by fine-tuning on a small validation set tend to be the most beneficial for reducing test loss on the target distribution. Experiments on instruction tuning and named entity recognition tasks show that, in most cases, our method achieves lower test log-loss than state-of-the-art approaches. We support our findings with theoretical analysis.
翻译:最先进的机器学习通常遵循两阶段流程:$(i)$~在大型通用数据集上进行预训练;$(ii)$~在任务特定数据上进行微调。在微调过程中,选择与目标分布高度匹配的训练样本至关重要。然而,通常只能从目标分布中获得少量样本。现有的数据选择方法将这些目标样本视为验证集,并通过在验证集上进行推理来评估从训练池中添加或移除单个样本的影响。我们提出了一种更简单、更快速的替代方案,该方法颠倒了训练集与验证集的传统角色:我们在基于验证集进行微调前后,分别对训练池样本进行推理预测,然后选择预测变化最大的样本。我们的核心洞见是:在小型验证集上进行微调时,受其影响最大的训练样本往往能最有效地降低目标分布上的测试损失。在指令微调和命名实体识别任务上的实验表明,在多数情况下,我们的方法能够取得比现有最优方法更低的测试对数损失。我们通过理论分析为这一发现提供了支撑。