Contemporary works on abstractive text summarization have focused primarily on high-resource languages like English, mostly due to the limited availability of datasets for low/mid-resource ones. In this work, we present XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation. We fine-tune mT5, a state-of-the-art pretrained multilingual model, with XL-Sum and experiment on multilingual and low-resource summarization tasks. XL-Sum induces competitive results compared to the ones obtained using similar monolingual datasets: we show higher than 11 ROUGE-2 scores on 10 languages we benchmark on, with some of them exceeding 15, as obtained by multilingual training. Additionally, training on low-resource languages individually also provides competitive performance. To the best of our knowledge, XL-Sum is the largest abstractive summarization dataset in terms of the number of samples collected from a single source and the number of languages covered. We are releasing our dataset and models to encourage future research on multilingual abstractive summarization. The resources can be found at \url{https://github.com/csebuetnlp/xl-sum}.
翻译:关于抽象文本摘要的当代著作主要侧重于英语等高资源语言,这主要是因为低/中资源类数据集的可用性有限。在这项工作中,我们介绍了由英国广播公司100万份专业附加注释文章和一组精心设计的超理论作品组成的全面多样的数据集XL-Sum。数据集涵盖44种语言,从低到高资源不等,许多语言目前没有公共数据集。XL-Sum非常抽象、简洁和高质量,正如人文和内在评价所表明的那样。我们介绍的是:SL-Sum,这是一个由100万份专业附加注释的文章和来自英国广播公司的一组全面而多样的数据。XL-Sum,与使用类似的单一语言数据集获得的数据相比,它带来了竞争性的结果:我们用10种语言衡量的ROUGE-2分高于11分,而通过多语言培训,其中一些超过15分。此外,关于低资源级语言的训练也单独提供经过精细培训的多语言多语言多语言多语言模式。我们所收集的S-Sum最有竞争力的样本数据来源、我们所收集的Simal sum 和Simal sumalalal 数据中的最佳来源。