代码转换数据语音标记部分 (Part of speech tagging for code switched data)

We address the problem of Part of Speech tagging (POS) in the context of linguistic code switching (CS). CS is the phenomenon where a speaker switches between two languages or variants of the same language within or across utterances, known as intra-sentential or inter-sentential CS, respectively. Processing CS data is especially challenging in intra-sentential data given state of the art monolingual NLP technology since such technology is geared toward the processing of one language at a time. In this paper we explore multiple strategies of applying state of the art POS taggers to CS data. We investigate the landscape in two CS language pairs, Spanish-English and Modern Standard Arabic-Arabic dialects. We compare the use of two POS taggers vs. a unified tagger trained on CS data. Our results show that applying a machine learning framework using two state of the art POS taggers achieves better performance compared to all other approaches that we investigate.

翻译：我们从语言代码转换(CS)的角度处理部分语言标记问题。 CS是一种现象,在语言代码转换(CS)的背景下,发言者在两种语言或同一语言的变体之间切换两种语言之间,分别称为文件内或文件间CS。处理 CS 数据在文件内数据方面特别具有挑战性,因为根据当时的状态,这种技术是针对一种语言的处理的。在本文中,我们探讨了对 CS 数据应用先进的 POS 标记器的多种战略。我们用两种CS 语言对口,即西班牙语英语和现代标准阿拉伯语方言,对两种POS 标记器的使用情况进行了比较。我们比较了CS 数据使用两个POS 标记器与统一塔格的使用情况。我们的结果显示,使用两种艺术 POS 标记器的机器学习框架比我们调查的所有其他方法都表现得更好。

相关内容

词性标注

关注 388

词性（part-of-speech）是词汇基本的语法属性，通常也称为词类。词性标注就是在给定句子中判定每个词的语法范畴，确定其词性并加以标注的过程，是中文信息处理面临的重要基础性问题。在语料库语言学中，词性标注（POS标注或PoS标注或POST），也称为语法标注，是将文本（语料库）中的单词标注为与特定词性相对应的过程，[1] 基于其定义和上下文。

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

专知会员服务

105+阅读 · 2020年6月10日

商业数据分析，39页ppt

专知会员服务

157+阅读 · 2020年6月2日