uguger: 隐蔽的预先训练语言模型,作为无人监督的中文拼写检查器 (uChecker: Masked Pretrained Language Models as Unsupervised Chinese Spelling Checkers)

The task of Chinese Spelling Check (CSC) is aiming to detect and correct spelling errors that can be found in the text. While manually annotating a high-quality dataset is expensive and time-consuming, thus the scale of the training dataset is usually very small (e.g., SIGHAN15 only contains 2339 samples for training), therefore supervised-learning based models usually suffer the data sparsity limitation and over-fitting issue, especially in the era of big language models. In this paper, we are dedicated to investigating the \textbf{unsupervised} paradigm to address the CSC problem and we propose a framework named \textbf{uChecker} to conduct unsupervised spelling error detection and correction. Masked pretrained language models such as BERT are introduced as the backbone model considering their powerful language diagnosis capability. Benefiting from the various and flexible MASKing operations, we propose a Confusionset-guided masking strategy to fine-train the masked language model to further improve the performance of unsupervised detection and correction. Experimental results on standard datasets demonstrate the effectiveness of our proposed model uChecker in terms of character-level and sentence-level Accuracy, Precision, Recall, and F1-Measure on tasks of spelling error detection and correction respectively.

翻译：中文拼写检查(CSC)的任务是检测和纠正文本中可以找到的拼写错误。虽然人工说明高质量数据集的费用昂贵且耗时费时,因此培训数据集的规模通常很小(例如SIGHAN15只包含2339个培训样本),因此,基于监督的学习模型通常会遇到数据宽度限制和超合适问题,特别是在大语言模型时代。本文中,我们致力于调查用于解决 CSC 问题的\ textbf{unurvived} 模式,并提议一个名为\ textbf{uchager} 的框架,以进行不受监督的拼写错误的检测和校正。将蒙蔽的预先语言模型,如BERT作为主干模型,考虑到其强大的语言诊断能力,因此引入了数据宽度限制和超强的问题。我们从各种灵活的MAKK操作中受益,我们提出了一个配置制式的掩码遮蔽式掩蔽策略,以进一步改进未校正的检测和校正的功能。标准数据设置的实验结果, 级的校正级别, 级的校正任务, 级的校正性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性