增加还是不增加? 低资源低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平低水平 (To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP)

Gözde Gül Şahin

from arxiv, Accepted to Computational Linguistics

Data-hungry deep neural networks have established themselves as the standard for many NLP tasks including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind of their statistical counter-parts in low-resource scenarios. One methodology to counter attack this problem is text augmentation, i.e., generating new synthetic training data points from existing data. Although NLP has recently witnessed a load of textual augmentation techniques, the field still lacks a systematic performance analysis on a diverse set of languages and sequence tagging tasks. To fill this gap, we investigate three categories of text augmentation methodologies which perform changes on the syntax (e.g., cropping sub-sentences), token (e.g., random word insertion) and character (e.g., character swapping) levels. We systematically compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families using various models including the architectures that rely on pretrained multilingual contextualized language models such as mBERT. Augmentation most significantly improves dependency parsing, followed by part-of-speech tagging and semantic role labeling. We find the experimented techniques to be effective on morphologically rich languages in general rather than analytic languages such as Vietnamese. Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT. We identify the character-level methods as the most consistent performers, while synonym replacement and syntactic augmenters provide inconsistent improvements. Finally, we discuss that the results most heavily depend on the task, language pair, and the model type.

翻译：缺乏数据深层神经网络已经将自身确定为许多 NLP 任务的标准, 包括传统排序标记任务。尽管高资源语言的高级表现最先进, 但它们仍然落后于低资源情景的统计对应部分。应对这一问题的方法之一是文本增强, 即从现有数据中产生新的合成培训数据点。尽管 NLP最近目睹了大量文本增强技术, 外地仍然缺乏对多种语言和序列标记任务的系统化绩效分析。为了填补这一空白, 我们调查了三种文本增强方法, 这些方法在语法( 例如, 裁剪裁子子句) 、符号( 例如, 随机字插入) 和字符( 例如, 字符转换) 方面都落后于。我们系统比较了部分语音标记、依赖度分解和语法作用, 使用多种模型, 包括我们依赖预先加固的多语系语言模型, 例如 mBERT 、符号( 随机拼写 ) 和最显著改进的变异性标签( ), 最终改进了我们标签的排序和变变变变的功能。