Diverse word representations have surged in most state-of-the-art natural language processing (NLP) applications. Nevertheless, how to efficiently evaluate such word embeddings in the informal domain such as Twitter or forums, remains an ongoing challenge due to the lack of sufficient evaluation dataset. We derived a large list of variant spelling pairs from UrbanDictionary with the automatic approaches of weakly-supervised pattern-based bootstrapping and self-training linear-chain conditional random field (CRF). With these extracted relation pairs we promote the odds of eliding the text normalization procedure of traditional NLP pipelines and directly adopting representations of non-standard words in the informal domain. Our code is available.
翻译:在大多数最先进的自然语言处理(NLP)应用中,多种语言的表达方式激增,然而,由于缺乏足够的评价数据集,如何有效评价在诸如Twitter或论坛等非正式领域嵌入的这种词,仍然是一项持续的挑战。我们从城市词典中得出了一大批不同的拼法配对清单,其自动方法是采用监督不力的模式式靴式穿梭和自我培训的线性链式随机字段。有了这些抽取的关系对,我们增加了在传统NLP管道的文本正常化程序上,在非正式领域直接采用非标准词的表达方式的可能性。我们的代码是现成的。