豪萨视觉基因组:多式英语至豪萨机器翻译数据集 (Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation)

Idris Abdulmumin,Satya Ranjan Dash,Musa Abdullahi Dawud,Shantipriya Parida,Shamsuddeen Hassan Muhammad,Ibrahim Sa'id Ahmad,Subhadarshi Panda,Ondřej Bojar,Bashir Shehu Galadanci,Bello Shehu Bello

from arxiv, Accepted at Language Resources and Evaluation Conference 2022 (LREC2022)

Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations. The visual information can serve as a valuable piece of context information to decrease the ambiguity of input sentences. Despite the increasing popularity of such a technique, good and sizeable datasets are scarce, limiting the full extent of their potential. Hausa, a Chadic language, is a member of the Afro-Asiatic language family. It is estimated that about 100 to 150 million people speak the language, with more than 80 million indigenous speakers. This is more than any of the other Chadic languages. Despite a large number of speakers, the Hausa language is considered low-resource in natural language processing (NLP). This is due to the absence of sufficient resources to implement most NLP tasks. While some datasets exist, they are either scarce, machine-generated, or in the religious domain. Therefore, there is a need to create training and evaluation data for implementing machine learning tasks and bridging the research gap in the language. This work presents the Hausa Visual Genome (HaVG), a dataset that contains the description of an image or a section within the image in Hausa and its equivalent in English. To prepare the dataset, we started by translating the English description of the images in the Hindi Visual Genome (HVG) into Hausa automatically. Afterward, the synthetic Hausa data was carefully post-edited considering the respective images. The dataset comprises 32,923 images and their descriptions that are divided into training, development, test, and challenge test set. The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, and image description, among various other natural language processing and generation tasks.

翻译：多式机器翻译(MMT)使得能够使用视觉信息来提高翻译质量。视觉信息可以作为一种宝贵的背景信息来降低输入语句的模糊性。尽管这种技术越来越受欢迎,但良好的和可观的数据集却很少,限制了其潜力的全部范围。Hausa(乍得语)是亚非语言大家庭的成员之一,估计约有1亿至1.5亿人会说这种语言,有超过8 000万土著语言。这比其他乍得语还要多。尽管有许多发言者,Hausa语被视为自然语言处理(NLP)中低资源的背景信息。这是由于缺乏足够资源来执行大部分NLP任务,因此限制了它们的潜力。Hausa语(乍得语)是非洲亚洲语言大家庭的成员之一。据估计,大约1亿至1.5亿人会说这种语言,有超过8 000万土著语言的使用者。这比其他任何乍得语语言都要多。尽管有很多发言者,Hausa语(Hausa)被视为天然语言处理(NLP)中的低资源。这是因为缺乏足够资源来执行大部分NLP任务。虽然有些数据集是稀缺的、机器生成的,但是从HARC的图像中开始, 并且从我们的图像转换数据中开始, 也是我们用来翻译。