XL-Sum:44种语文的大型多语种抽象总结 (XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages)

Contemporary works on abstractive text summarization have focused primarily on high-resource languages like English, mostly due to the limited availability of datasets for low/mid-resource ones. In this work, we present XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation. We fine-tune mT5, a state-of-the-art pretrained multilingual model, with XL-Sum and experiment on multilingual and low-resource summarization tasks. XL-Sum induces competitive results compared to the ones obtained using similar monolingual datasets: we show higher than 11 ROUGE-2 scores on 10 languages we benchmark on, with some of them exceeding 15, as obtained by multilingual training. Additionally, training on low-resource languages individually also provides competitive performance. To the best of our knowledge, XL-Sum is the largest abstractive summarization dataset in terms of the number of samples collected from a single source and the number of languages covered. We are releasing our dataset and models to encourage future research on multilingual abstractive summarization. The resources can be found at \url{https://github.com/csebuetnlp/xl-sum}.

翻译：关于抽象文本摘要的当代著作主要侧重于英语等高资源语言,这主要是因为低/中资源类数据集的可用性有限。在这项工作中,我们介绍了由英国广播公司100万份专业附加注释文章和一组精心设计的超理论作品组成的全面多样的数据集XL-Sum。数据集涵盖44种语言,从低到高资源不等,许多语言目前没有公共数据集。XL-Sum非常抽象、简洁和高质量,正如人文和内在评价所表明的那样。我们介绍的是:SL-Sum,这是一个由100万份专业附加注释的文章和来自英国广播公司的一组全面而多样的数据。XL-Sum,与使用类似的单一语言数据集获得的数据相比,它带来了竞争性的结果:我们用10种语言衡量的ROUGE-2分高于11分,而通过多语言培训,其中一些超过15分。此外,关于低资源级语言的训练也单独提供经过精细培训的多语言多语言多语言多语言模式。我们所收集的S-Sum最有竞争力的样本数据来源、我们所收集的Simal sum 和Simal sumalalal 数据中的最佳来源。

相关内容

数据集

关注 83

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【IJCAI2020】神经摘要结构性注意力，Neural Abstractive Summarization with Structural Attention

专知会员服务

32+阅读 · 2020年4月24日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

161+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

52+阅读 · 2020年1月30日

【浙江大学-AAAI2020】领域自适应的对抗损失，Adversarial-Learned Loss for Domain Adaptation

专知会员服务

61+阅读 · 2020年1月11日