Grammatical error correction (GEC) is a challenging task of natural language processing techniques. While more attempts are being made in this approach for universal languages like English or Chinese, relatively little work has been done for low-resource languages for the lack of large annotated corpora. In low-resource languages, the current unsupervised GEC based on language model scoring performs well. However, the pre-trained language model is still to be explored in this context. This study proposes a BERT-based unsupervised GEC framework, where GEC is viewed as multi-class classification task. The framework contains three modules: data flow construction module, sentence perplexity scoring module, and error detecting and correcting module. We propose a novel scoring method for pseudo-perplexity to evaluate a sentence's probable correctness and construct a Tagalog corpus for Tagalog GEC research. It obtains competitive performance on the Tagalog corpus we construct and open-source Indonesian corpus and it demonstrates that our framework is complementary to baseline method for low-resource GEC task.
翻译:语法错误纠正(GEC)是自然语言处理技术中的一个具有挑战性的任务。虽然针对普通的语言如英语或汉语等进行了更多的尝试,但是在低资源语言中(缺乏大型的注释语料库),目前的基于语言模型评分的无监督GEC表现良好。然而,预训练语言模型在这一领域中仍需要进行探索。该研究提出了一个基于BERT的无监督GEC框架,其中GEC被视为多类别分类任务。该框架包含三个模块:数据流构建模块、句子困惑度评分模块和错误检测和纠正模块。我们提出了一种新的伪困惑度评分方法来评估句子的可能正确性,并为菲律宾语构建了一个语料库,用于菲律宾语GEC研究。它在我们构建的菲律宾语语料库和开源的印尼语语料库中获得了竞争性的性能,并且证明了我们的框架对于低资源GEC任务是有补充价值的。