Most work in text classification and Natural Language Processing (NLP) focuses on English or a handful of other languages that have text corpora of hundreds of millions of words. This is creating a new version of the digital divide: the artificial intelligence (AI) divide. Transfer-based approaches, such as Cross-Lingual Text Classification (CLTC) - the task of categorizing texts written in different languages into a common taxonomy, are a promising solution to the emerging AI divide. Recent work on CLTC has focused on demonstrating the benefits of using bilingual word embeddings as features, relegating the CLTC problem to a mere benchmark based on a simple averaged perceptron. In this paper, we explore more extensively and systematically two flavors of the CLTC problem: news topic classification and textual churn intent detection (TCID) in social media. In particular, we test the hypothesis that embeddings with context are more effective, by multi-tasking the learning of multilingual word embeddings and text classification; we explore neural architectures for CLTC; and we move from bi- to multi-lingual word embeddings. For all architectures, types of word embeddings and datasets, we notice a consistent gain trend in favor of multilingual joint training, especially for low-resourced languages.
翻译:文本分类和自然语言处理(NLP)的大部分工作都侧重于英文或其他少数语言,这些语言的文字嵌入具有数亿字数的字体。这正在形成一个新的数字鸿沟版本:人工智能(AI)鸿沟。基于传输的方法,如跨语言文本分类(CLTC),将不同语言编写的文本分类为共同分类,这是解决新兴AI鸿沟的一个有希望的办法。最近关于CLTC的工作侧重于展示使用双语词嵌入功能的好处,将CLTC问题改成一个基于简单平均概念的简单基准。在本文中,我们更广泛和系统地探讨CLTC问题的两种口味:社交媒体中新闻主题分类和文本切换意图检测。特别是,我们测试将内容嵌入背景的假设,通过多任务学习多语言嵌入和文本分类;我们探索CLTC的线性结构;我们从双语言嵌入到多语言嵌入式的简单平均概念。我们从双语言到多语言嵌入式语言,特别是多语言嵌入式的联合趋势。