自然语言处理中的数据增加:长文本和短文本分类的新颖文本制作方法 (Data Augmentation in Natural Language Processing: A Novel Text Generation Approach for Long and Short Text Classifiers)

In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53% and 3.56% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets.

翻译：在机器学习的许多情况下,研究表明,开发培训数据可能比选择和建模分类人员本身具有更高的相关性,因此,开发了数据增强方法,通过人工制作的培训数据来改进分类人员。在国家实验室中,挑战是如何为提供新语言模式的文本转换制定通用规则。在本文中,我们介绍和评价了一种适合提高分类人员长短文本绩效的文本生成方法。在用我们的文本生成方法改进了短文本任务时,我们取得了有希望的改进。特别是在小数据分析方面,与未加增基准和另一种数据增强技术相比,在构建的低数据制度内实现了高达15.53%和3.56%的添加精度增益。由于这些已建制度目前的轨迹并不具备普遍适用性,我们还展示了几个真正的世界低数据任务(高达+4.84 F1-核心)的重大改进。由于我们从许多角度(总共11个数据集)对方法进行了评估,我们还注意到方法可能不适合的情况。我们讨论了成功应用不同类型数据方法的影响和模式。