多语言开放文本第1版:44种语言的公共域新闻 (Multilingual Open Text Release 1: Public Domain News in 44 Languages) - 专知论文

会员服务 ·

0

Processing（编程语言） · 讲稿 · 知识共享（Creative Commons） · CC · 麻省理工学院 ·

2022 年 6 月 9 日

Multilingual Open Text Release 1: Public Domain News in 44 Languages

翻译：多语言开放文本第1版:44种语言的公共域新闻

Chester Palen-Michel,June Kim,Constantine Lignos

from arxiv, Submitted to LREC 2022

We present Multilingual Open Text (MOT), a new multilingual corpus containing text in 44 languages, many of which have limited existing text resources for natural language processing. The first release of the corpus contains over 2.8 million news articles and an additional 1 million short snippets (photo captions, video descriptions, etc.) published between 2001--2022 and collected from Voice of America's news websites. We describe our process for collecting, filtering, and processing the data. The source material is in the public domain, our collection is licensed using a creative commons license (CC BY 4.0), and all software used to create the corpus is released under the MIT License. The corpus will be regularly updated as additional documents are published.

翻译：我们推出了新的多语言开放文本(MOT),这是一个包含44种语言文本的新的多语言文件,其中许多语言现有文本资源有限,可用于自然语言处理,第一版包含280多万篇新闻文章和另外100万个短片(照片字幕、视频描述等),这些短片在2001至2022年期间出版,从美国之音新闻网站上收集。我们描述了我们收集、过滤和处理数据的过程。原始材料在公共领域,我们收藏的许可证是使用创造性的共同许可证(CC by 4.0),而用于创建该材料的所有软件都根据麻省理工学院许可证发布。随着其他文件的出版,该材料将定期更新。

0

相关内容

Processing（编程语言）

Processing（编程语言）

Processing 是一门开源编程语言和与之配套的集成开发环境（IDE）的名称。Processing 在电子艺术和视觉设计社区被用来教授编程基础，并运用于大量的新媒体和互动艺术作品中。

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

中国图象图形学学会CSIG

2+阅读 · 2021年11月12日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

番茄长链非编码RNA LeLNR1在抗TYLCV中的作用及抗凋亡机制解析

国家自然科学基金

0+阅读 · 2014年12月31日

维生素A缺乏/过量诱发胎鼠腭裂的作用机制

国家自然科学基金

0+阅读 · 2013年12月31日

雄激素经AR/PI3K/AKT通路调控CA916798参与肺腺癌发生的作用及机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

Intraflagellar Transport运输纤毛蛋白的分子机理

国家自然科学基金

0+阅读 · 2012年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

How can I improve my scientific impact? The most influential factors in predicting the h-index

How can I improve my scientific impact? The most influential factors in predicting the h-index

Arxiv

0+阅读 · 2022年7月21日

Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing

Arxiv

0+阅读 · 2022年7月21日

Domain Generalization in Vision: A Survey

Arxiv

17+阅读 · 2021年7月18日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

Which Knowledge Graph Is Best for Me?

Arxiv

11+阅读 · 2018年9月28日

VIP会员

文章信息

相关主题

Processing（编程语言）

知识共享（Creative Commons）

麻省理工学院

相关VIP内容

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《运用增强现实技术进行军事任务规划》130页

《高压决策环境中的人机协作》200页博士论文

《2025财年美陆军转型倡议（ATI）部队结构与组织提案》

《探索用于低层级任务区分与分类的转址旁路缓冲》

相关资讯

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

中国图象图形学学会CSIG

2+阅读 · 2021年11月12日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

How can I improve my scientific impact? The most influential factors in predicting the h-index

How can I improve my scientific impact? The most influential factors in predicting the h-index

Arxiv

0+阅读 · 2022年7月21日

Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing

Arxiv

0+阅读 · 2022年7月21日

Domain Generalization in Vision: A Survey

Arxiv

17+阅读 · 2021年7月18日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

Which Knowledge Graph Is Best for Me?

Arxiv

11+阅读 · 2018年9月28日

相关基金

番茄长链非编码RNA LeLNR1在抗TYLCV中的作用及抗凋亡机制解析

国家自然科学基金

0+阅读 · 2014年12月31日

维生素A缺乏/过量诱发胎鼠腭裂的作用机制

国家自然科学基金

0+阅读 · 2013年12月31日

雄激素经AR/PI3K/AKT通路调控CA916798参与肺腺癌发生的作用及机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

Intraflagellar Transport运输纤毛蛋白的分子机理

国家自然科学基金

0+阅读 · 2012年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员