In Natural Language Processing (NLP), finding data augmentation techniques that can produce high-quality human-interpretable examples has always been challenging. Recently, leveraging kNN such that augmented examples are retrieved from large repositories of unlabelled sentences has made a step toward interpretable augmentation. Inspired by this paradigm, we introduce Minimax-kNN, a sample efficient data augmentation strategy tailored for Knowledge Distillation (KD). We exploit a semi-supervised approach based on KD to train a model on augmented data. In contrast to existing kNN augmentation techniques that blindly incorporate all samples, our method dynamically selects a subset of augmented samples that maximizes KL-divergence between the teacher and student models. This step aims to extract the most efficient samples to ensure our augmented data covers regions in the input space with maximum loss value. We evaluated our technique on several text classification tasks and demonstrated that Minimax-kNN consistently outperforms strong baselines. Our results show that Minimax-kNN requires fewer augmented examples and less computation to achieve superior performance over the state-of-the-art kNN-based augmentation techniques.
翻译:在自然语言处理(NLP)中,寻找能够产生高质量的人类可解释实例的数据增强技术一直具有挑战性。最近,利用KNN这种能够从无标签判决的大储存库中检索到更多实例的杠杆式KNN为可解释性增强迈出了一步。受这一范式的启发,我们引入了为知识蒸馏而专门设计的样本式高效数据增强战略Minimax-kNNN。我们利用基于KD的半监督方法来培训增强数据的模型。与盲目地纳入所有样本的现有 kNN增强技术相比,我们的方法动态地选择了一组扩大样本,使教师和学生模型之间的KL-diverence最大化。这一步骤旨在提取最高效的样本,以确保我们扩大的数据覆盖输入空间中具有最大损失价值的区域。我们评估了我们的一些文本分类任务的技术,并证明Minimax-kNNNN始终超越强的基线。我们的结果显示,Minimax-kNNN需要较少增加的示例和较少的计算,以达到最优超于以这种基于KNNNE的增强技术的绩效。