词性(part-of-speech)是词汇基本的语法属性,通常也称为词类。词性标注就是在给定句子中判定每个词的语法范畴,确定其词性并加以标注的过程,是中文信息处理面临的重要基础性问题。在语料库语言学中,词性标注(POS标注或PoS标注或POST),也称为语法标注,是将文本(语料库)中的单词标注为与特定词性相对应的过程,[1] 基于其定义和上下文。

VIP内容

题目: Should All Cross-Lingual Embeddings Speak English?

摘要:

最近关于跨语言词嵌入的研究大多以英语为中心。绝大多数词汇归纳评价词典都介于英语和另一种语言之间,在多语言环境下学习时,默认选择英语嵌入空间作为中心。然而,通过这项工作,我们对这些实践提出了挑战。首先,我们证明了中心语言的选择对下游词汇归纳和零标注词性标注性能有显著的影响。其次,我们都扩展了一个以英语为中心的标准评估词典集合,以包括所有使用三角统计的语言对,并为代表不足的语言创建新的词典。对所有这些语言对的现有方法进行评估,有助于了解它们是否适合对来自遥远语言的嵌入进行校准,并为该领域带来新的挑战。最后,在我们的分析中,我们确定了强跨语言嵌入基线的一般准则,扩展到不包括英语的语言对。

成为VIP会员查看完整内容
0
5

最新内容

Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researchers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by failure to capture authorship style, by the topic shift or by other factors. Motivated by this, we propose the \emph{topic confusion} task, where we switch the author-topic configuration between training and testing set. This setup allows us to probe errors in the attribution process. We investigate the accuracy and two error measures: one caused by the models' confusion by the switch because the features capture the topics, and one caused by the features' inability to capture the writing styles, leading to weaker models. By evaluating different features, we show that stylometric features with part-of-speech tags are less susceptible to topic variations and can increase the accuracy of the attribution process. We further show that combining them with word-level $n$-grams can outperform the state-of-the-art technique in the cross-topic scenario. Finally, we show that pretrained language models such as BERT and RoBERTa perform poorly on this task, and are outperformed by simple $n$-gram features.

0
0
下载
预览

最新论文

Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researchers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by failure to capture authorship style, by the topic shift or by other factors. Motivated by this, we propose the \emph{topic confusion} task, where we switch the author-topic configuration between training and testing set. This setup allows us to probe errors in the attribution process. We investigate the accuracy and two error measures: one caused by the models' confusion by the switch because the features capture the topics, and one caused by the features' inability to capture the writing styles, leading to weaker models. By evaluating different features, we show that stylometric features with part-of-speech tags are less susceptible to topic variations and can increase the accuracy of the attribution process. We further show that combining them with word-level $n$-grams can outperform the state-of-the-art technique in the cross-topic scenario. Finally, we show that pretrained language models such as BERT and RoBERTa perform poorly on this task, and are outperformed by simple $n$-gram features.

0
0
下载
预览
Top