Although researchers and practitioners are pushing the boundaries and enhancing the capacities of NLP tools and methods, works on African languages are lagging. A lot of focus on well resourced languages such as English, Japanese, German, French, Russian, Mandarin Chinese etc. Over 97% of the world's 7000 languages, including African languages, are low resourced for NLP i.e. they have little or no data, tools, and techniques for NLP research. For instance, only 5 out of 2965, 0.19% authors of full text papers in the ACL Anthology extracted from the 5 major conferences in 2018 ACL, NAACL, EMNLP, COLING and CoNLL, are affiliated to African institutions. In this work, we discuss our effort toward building a standard machine translation benchmark dataset for Igbo, one of the 3 major Nigerian languages. Igbo is spoken by more than 50 million people globally with over 50% of the speakers are in southeastern Nigeria. Igbo is low resourced although there have been some efforts toward developing IgboNLP such as part of speech tagging and diacritic restoration
翻译:虽然研究人员和从业人员正在推动国家语言方案工具和方法的边界并提高其能力,但有关非洲语言的工作仍然落后,大量关注资源丰富的语言,如英语、日语、德语、法语、俄语、普通汉语等,包括非洲语言在内的世界上7000种语言中,97%以上,包括非洲语言,用于国家语言方案的资源不足,也就是说,他们几乎没有或根本没有国家语言方案研究的数据、工具和技术。例如,在2965年的ACL Anthlogy中,从2018年的ACL、NAACL、EMNLP、COLLing和CONLLL等5次主要会议摘录的完整文本文件的作者只有5 %, 在2018年的ACL、NAACL、EMLP、COLing和CONLLLL等5次主要会议中,非洲机构拥有大量资源。在这项工作中,我们讨论了我们为Igbo(尼日利亚的3种主要语言之一)建立标准机器翻译基准数据集的努力。全球有5 000多万人讲Igbo语,50%以上在尼日利亚东南部。Igbo讲者讲了50%以上。Igbo语。Igbo是资源不足,尽管在开发IgbONLP方面作出了一些努力,例如一些努力,例如语音标记和dicticredicredicticticticlyinginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginginging and。