The ever-growing volume of data of user-generated content on social media provides a nearly unlimited corpus of unlabeled data even in languages where resources are scarce. In this paper, we demonstrate that state-of-the-art results on two Thai social text categorization tasks can be realized by pretraining a language model on a large noisy Thai social media corpus of over 1.26 billion tokens and later fine-tuned on the downstream classification tasks. Due to the linguistically noisy and domain-specific nature of the content, our unique data preprocessing steps designed for Thai social media were utilized to ease the training comprehension of the model. We compared four modern language models: ULMFiT, ELMo with biLSTM, OpenAI GPT, and BERT. We systematically compared the models across different dimensions including speed of pretraining and fine-tuning, perplexity, downstream classification benchmarks, and performance in limited pretraining data.
翻译:社交媒体上用户生成内容的数据数量不断增加,提供了几乎无限的无标签数据,即使以缺乏资源的语言提供。在本文件中,我们证明,在泰国两种社会文本分类任务方面,通过在泰国一个大型吵闹的社会媒体库(超过12.6亿个象征物)上预先培训一种语言模型,以及随后对下游分类任务进行微调,可以实现两个泰国社会文本分类任务的最新成果。由于内容内容的语言吵闹和具体领域性质,我们为泰国社会媒体设计的独特数据处理预处理步骤被用来便利对模式的培训理解。我们比较了四种现代语言模型:ULMFiT、ELMO和BLSTM、OpenAI GPT和BERT。我们系统地比较了不同层面的模式,包括培训前和微调速度、易懂性、下游分类基准和有限培训前数据的性能。