Tatoeba翻译挑战 -- -- 低资源和多语言MT的现实数据集 (The Tatoeba Translation Challenge -- Realistic Data Sets for Low Resource and Multilingual MT)

This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs covering over 500 languages and tools for creating state-of-the-art translation models from that collection. The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages. Using the package it is possible to work on realistic low-resource scenarios avoiding artificially reduced setups that are common when demonstrating zero-shot or few-shot learning. For the first time, this package provides a comprehensive collection of diverse data sets in hundreds of languages with systematic language and script annotation and data splits to extend the narrow coverage of existing benchmarks. Together with the data release, we also provide a growing number of pre-trained baseline models for individual language pairs and selected language groups.

翻译：本文件介绍为机器翻译制定新的基准,为涵盖500多种语文的数千对语文提供培训和测试数据,并为从该汇编中创建最先进的翻译模型提供工具;主要目标是推动开发开放翻译工具和模型,更广泛地覆盖世界各语文;利用这套软件,可以制定现实的低资源情景,避免人为减少设置,而这种设置在展示零点或少点显示学习时是常见的。这是第一次,这套软件以数百种语文全面收集各种数据集,有系统的语言和脚本说明,数据分解,以扩大现有基准的狭窄覆盖范围。与数据发布一起,我们还为个别语文对口和选定语文组提供了越来越多的经过预先培训的基线模型。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

75+阅读 · 2020年7月26日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

113+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

161+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

91+阅读 · 2020年3月12日