The evolution of language follows the rule of gradual change. Grammar, vocabulary, and lexical semantic shifts take place over time, resulting in a diachronic linguistic gap. As such, a considerable amount of texts are written in languages of different eras, which creates obstacles for natural language processing tasks, such as word segmentation and machine translation. Although the Chinese language has a long history, previous Chinese natural language processing research has primarily focused on tasks within a specific era. Therefore, we propose a cross-era learning framework for Chinese word segmentation (CWS), CROSSWISE, which uses the Switch-memory (SM) module to incorporate era-specific linguistic knowledge. Experiments on four corpora from different eras show that the performance of each corpus significantly improves. Further analyses also demonstrate that the SM can effectively integrate the knowledge of the eras into the neural network.
翻译:语言的演变遵循了逐渐变化的规则。语法、词汇和词汇语义的变迁会随着时间的流逝而发生,从而导致语言差异。因此,大量文本是以不同时代的语言撰写的,这给自然语言处理任务造成了障碍,如文字分割和机器翻译。虽然中文有悠久的历史,但先前的中国自然语言处理研究主要侧重于特定时代的任务。因此,我们提议了中国文字分割(CWS)跨时代学习框架,CrossWISSE,它使用切换模版(SM)将特定时代的语言知识纳入其中。对不同时代的四个子体的实验表明,每个体的性能都有显著改善。进一步的分析还表明,SM能够有效地将时代的知识融入神经网络。