Machine Reading Comprehension (MRC) has become one of the essential tasks in Natural Language Understanding (NLU) as it is often included in several NLU benchmarks (Liang et al., 2020; Wilie et al., 2020). However, most MRC datasets only have answerable question type, overlooking the importance of unanswerable questions. MRC models trained only on answerable questions will select the span that is most likely to be the answer, even when the answer does not actually exist in the given passage (Rajpurkar et al., 2018). This problem especially remains in medium- to low-resource languages like Indonesian. Existing Indonesian MRC datasets (Purwarianti et al., 2007; Clark et al., 2020) are still inadequate because of the small size and limited question types, i.e., they only cover answerable questions. To fill this gap, we build a new Indonesian MRC dataset called I(n)don'tKnow- MRC (IDK-MRC) by combining the automatic and manual unanswerable question generation to minimize the cost of manual dataset construction while maintaining the dataset quality. Combined with the existing answerable questions, IDK-MRC consists of more than 10K questions in total. Our analysis shows that our dataset significantly improves the performance of Indonesian MRC models, showing a large improvement for unanswerable questions.
翻译:在自然语言理解(NLU)中,机器阅读理解(MRC)已经成为基本任务之一,因为这个问题经常被纳入若干NLU基准(Liang等人,2020年;Wilie等人,2020年)。然而,大多数MRC数据集都只具有可回答的问题类型,忽略了无法回答问题的重要性。仅对可回答问题进行训练的MRC模型将选择最有可能回答的跨度,即使答案在给定的段落(Rajpurkar等人,2018年)中并不存在,但这一问题仍然特别存在于印度尼西亚语等中低资源语言中。现有的印度尼西亚MRC数据集(Purwarianti等人,2007年;Clark等人,2020年)仍然不够充分,因为其规模小,且问题类型有限,即它们只包含可回答的问题。为了填补这一空白,我们将印度尼西亚MRC的新数据集称为I(n)n'dn't knew-MRC(IDK-MRC),将无法回答的问题自动和人工生成问题组合起来,以最大限度地减少人工数据存储MRMR的总成本,同时维持我们目前对10号数据进行高质量的数据分析。