Polylingual Text Classification (PLC) consists of automatically classifying, according to a common set C of classes, documents each written in one of a set of languages L, and doing so more accurately than when naively classifying each document via its corresponding language-specific classifier. In order to obtain an increase in the classification accuracy for a given language, the system thus needs to also leverage the training examples written in the other languages. We tackle multilabel PLC via funnelling, a new ensemble learning method that we propose here. Funnelling consists of generating a two-tier classification system where all documents, irrespectively of language, are classified by the same (2nd-tier) classifier. For this classifier all documents are represented in a common, language-independent feature space consisting of the posterior probabilities generated by 1st-tier, language-dependent classifiers. This allows the classification of all test documents, of any language, to benefit from the information present in all training documents, of any language. We present substantial experiments, run on publicly available polylingual text collections, in which funnelling is shown to significantly outperform a number of state-of-the-art baselines. All code and datasets (in vector form) are made publicly available.
翻译:根据一套通用的C类分类法,多语言文本分类法(PLC)包括自动分类,根据一套通用的C类分类法,将每份文件以一组语言之一写成的文件自动分类,L类文件的每个文件以其中一种语言写成,比通过相应的语言分类法对每份文件进行天真的分类时更精确。为了提高某一语言的分类准确性,该系统因此也需要利用以其他语言写成的培训实例。我们通过互换处理多标签 PLC,这是我们在此建议的一种新的混合学习方法。混合法包括产生一种两级分类制度,即所有文件,不论其语言如何,都由同一语言分类法分类法分类法分类法分类法分类法分类法分类法分类。对于这一分类法,所有文件都以共同的、依赖语言分类法分类法分类法的特性空间为代表了所有文件,由一级、语言独立的分类法分类法分类法分类法分类法生成的近似性参数构成。这样就可以将任何语言的所有测试文件分类法文本从所有培训文件中提供的信息中受益。我们进行实质性的实验,在公开的多种语言文本汇编中进行,其中显示互译制为大大超越了数据基数。