Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pretrained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.
翻译:词汇规范化研究旨在应对处理用户生成文本中非规范表达的挑战,然而缺乏全面评估使得何种方法能在多维度表现优异尚不明确。聚焦于无分词语言,我们做出三项关键贡献:(1)构建了一个大规模、多领域的日语规范化数据集,(2)基于前沿预训练模型开发了规范化方法,(3)从多个评估维度开展了实验。实验表明,仅编码器与仅解码器方法在准确性与效率方面均取得了显著成果。