Extreme multi-label text classification (XMTC) is the task of finding the most relevant subset labels from an extremely large-scale label collection. Recently, some deep learning models have achieved state-of-the-art results in XMTC tasks. These models commonly predict scores for all labels by a fully connected layer as the last layer of the model. However, such models can't predict a relatively complete and variable-length label subset for each document, because they select positive labels relevant to the document by a fixed threshold or take top k labels in descending order of scores. A less popular type of deep learning models called sequence-to-sequence (Seq2Seq) focus on predicting variable-length positive labels in sequence style. However, the labels in XMTC tasks are essentially an unordered set rather than an ordered sequence, the default order of labels restrains Seq2Seq models in training. To address this limitation in Seq2Seq, we propose an autoregressive sequence-to-set model for XMTC tasks named OTSeq2Set. Our model generates predictions in student-forcing scheme and is trained by a loss function based on bipartite matching which enables permutation-invariance. Meanwhile, we use the optimal transport distance as a measurement to force the model to focus on the closest labels in semantic label space. Experiments show that OTSeq2Set outperforms other competitive baselines on 4 benchmark datasets. Especially, on the Wikipedia dataset with 31k labels, it outperforms the state-of-the-art Seq2Seq method by 16.34% in micro-F1 score. The code is available at https://github.com/caojie54/OTSeq2Set.
翻译:极端多标签文本分类 (XMTC) 是从极大型标签收集中找到最相关的子标签的任务。 最近, 一些深层次学习模型在 XMTC 任务中实现了最先进的结果。 这些模型通常通过完全连接的层来预测所有标签的分数, 这是模型的最后一层。 然而, 这些模型无法预测每个文档的相对完整和可变长标签子, 因为他们选择了一个固定的阈值, 或者以降分顺序选择与文档相关的正面标签。 一种不太受欢迎的深层次学习模型类型, 叫做序列到序列( Seq2Seq2Seq), 重点是在序列样式中预测变长的正值。 然而, XMTC 任务中的标签基本上是一个未排序的设置, 而不是一个订购的顺序。 然而, 这些标签的默认顺序限制了每个文档的 seq2Seq2Sequareet, 我们提议在名为 ATSeqelex2Setrealal- develrial lax lex-deal deal laisal ladeal lax lax lader Sal- sqour lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax ro lax lax rod rod rods rods lax lax lax lax rod rod rod rod rods lax rod rods lax rods rods lax rods rogres max rodds rogres rogres rogres rods rods rods rods rods rods rods roddddddal rodal rods ro ro ro rod rod ro ro rod rods rods rods rod lad ro ro ro ro ro ro ro ro ro ro