This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs covering over 500 languages and tools for creating state-of-the-art translation models from that collection. The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages. Using the package it is possible to work on realistic low-resource scenarios avoiding artificially reduced setups that are common when demonstrating zero-shot or few-shot learning. For the first time, this package provides a comprehensive collection of diverse data sets in hundreds of languages with systematic language and script annotation and data splits to extend the narrow coverage of existing benchmarks. Together with the data release, we also provide a growing number of pre-trained baseline models for individual language pairs and selected language groups.
翻译:本文件介绍为机器翻译制定新的基准,为涵盖500多种语文的数千对语文提供培训和测试数据,并为从该汇编中创建最先进的翻译模型提供工具;主要目标是推动开发开放翻译工具和模型,更广泛地覆盖世界各语文;利用这套软件,可以制定现实的低资源情景,避免人为减少设置,而这种设置在展示零点或少点显示学习时是常见的。这是第一次,这套软件以数百种语文全面收集各种数据集,有系统的语言和脚本说明,数据分解,以扩大现有基准的狭窄覆盖范围。与数据发布一起,我们还为个别语文对口和选定语文组提供了越来越多的经过预先培训的基线模型。