“通向印度的通向印度”:印度语言的预培训的单词嵌入式 ("A Passage to India": Pre-trained Word Embeddings for Indian Languages)

Dense word vectors or 'word embeddings' which encode semantic properties of words, have now become integral to NLP tasks like Machine Translation (MT), Question Answering (QA), Word Sense Disambiguation (WSD), and Information Retrieval (IR). In this paper, we use various existing approaches to create multiple word embeddings for 14 Indian languages. We place these embeddings for all these languages, viz., Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani, Malayalam, Marathi, Nepali, Odiya, Punjabi, Sanskrit, Tamil, and Telugu in a single repository. Relatively newer approaches that emphasize catering to context (BERT, ELMo, etc.) have shown significant improvements, but require a large amount of resources to generate usable models. We release pre-trained embeddings generated using both contextual and non-contextual approaches. We also use MUSE and XLM to train cross-lingual embeddings for all pairs of the aforementioned languages. To show the efficacy of our embeddings, we evaluate our embedding models on XPOS, UPOS and NER tasks for all these languages. We release a total of 436 models using 8 different approaches. We hope they are useful for the resource-constrained Indian language NLP. The title of this paper refers to the famous novel 'A Passage to India' by E.M. Forster, published initially in 1924.

翻译：包含语言语义的刻录式文字矢量或“ 字嵌入”, 将文字的语义特性编码为语言的语义特性, 现在已经成为国家语言方案任务的组成部分, 如机器翻译、问答、 Word Sense Disamdiguation( WSD) 和信息检索( IR) 。在本文中, 我们使用各种现有方法为14种印度语言创建多字嵌入。我们用各种语言, 即 Assamese、 Bengali、 Gulatati、 Indim、 Kannada、 Konkani、 Malyalalam、 Marathi、 Nepali、 Odiya、 Punjai、 Sanskrit、 Tamil 和 Telugu, 在一个单一的存储库中。相对新的方法, 重点是要向环境环境( BERT、 ELMo 等) 创建多语言的多词嵌入。我们用背景和非外语种语言来发布经过预先训练的嵌入的嵌入。我们还使用MUSTE 和 XLM 来培训所有上述语言的文档的交叉嵌入。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【Hugging Face】指导文本生成与约束波束搜索🤗Transformers，Guiding Text Generation with Constrained Beam Search in 🤗 Transformers

专知会员服务

21+阅读 · 2022年3月18日

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

14+阅读 · 2022年3月13日

最新《Transformers模型》教程，64页ppt

专知会员服务

304+阅读 · 2020年11月26日

NLP必读经典文献100篇

专知会员服务

123+阅读 · 2020年9月8日