Training a model for grammatical error correction (GEC) requires a set of labeled ungrammatical / grammatical sentence pairs, but manually annotating such pairs can be expensive. Recently, the Break-It-Fix-It (BIFI) framework has demonstrated strong results on learning to repair a broken program without any labeled examples, but this relies on a perfect critic (e.g., a compiler) that returns whether an example is valid or not, which does not exist for the GEC task. In this work, we show how to leverage a pretrained language model (LM) in defining an LM-Critic, which judges a sentence to be grammatical if the LM assigns it a higher probability than its local perturbations. We apply this LM-Critic and BIFI along with a large set of unlabeled sentences to bootstrap realistic ungrammatical / grammatical pairs for training a corrector. We evaluate our approach on GEC datasets across multiple domains (CoNLL-2014, BEA-2019, GMEG-wiki and GMEG-yahoo) and show that it outperforms existing methods in both the unsupervised setting (+7.7 F0.5) and the supervised setting (+0.5 F0.5).
翻译:用于校正语法错误校正的模型( GEC ) 需要一套标记的未语法/ 语法句配对的非语法/ 语法句配对, 但手动批注这种配对可能很昂贵。 最近, 突破 lt- Fix- lt (BIFI) 框架( Break- It- Fix- It (BIFI) 框架( BIFI) 在学习修补破碎的程序方面展示了强大的成果, 但没有贴标签的例子, 但是这依赖于一个完美的批评者( 例如, 编译者), 该评论者返回一个范例是否有效, 而对于 GEC 的任务来说, 并不存在。 在这项工作中, 我们展示了如何利用预先训练的语言模式来定义 LM- Critic, 如果 LM 指派的概率高于 本地扰动率, 则判断一个语法则语法系化的语法系。 我们应用这个语法系和无标签的语系组合, 将现有GEGF7+25 ( GEG+G) 设置方法。