Multilingual evaluation benchmarks usually contain limited high-resource languages and do not test models for specific linguistic capabilities. CheckList is a template-based evaluation approach that tests models for specific capabilities. The CheckList template creation process requires native speakers, posing a challenge in scaling to hundreds of languages. In this work, we explore multiple approaches to generate Multilingual CheckLists. We device an algorithm - Template Extraction Algorithm (TEA) for automatically extracting target language CheckList templates from machine translated instances of a source language templates. We compare the TEA CheckLists with CheckLists created with different levels of human intervention. We further introduce metrics along the dimensions of cost, diversity, utility, and correctness to compare the CheckLists. We thoroughly analyze different approaches to creating CheckLists in Hindi. Furthermore, we experiment with 9 more different languages. We find that TEA followed by human verification is ideal for scaling Checklist-based evaluation to multiple languages while TEA gives a good estimates of model performance.
翻译:多语言评价基准通常包含有限的高资源语言,并不测试特定语言能力的模式。 核对列表是一种基于模板的评估方法,用于测试特定能力的模式。 核对列表的创建过程需要本地语言者,这给推广到数百种语言带来了挑战。 在这项工作中,我们探索了多种方法来生成多语言核对列表者。 我们设置了一种算法 — 模板提取算法(TEA),用于自动从源语言模板的机器翻译实例中提取目标语言校验列表模板。 我们比较了TEA核对列表者与以不同水平的人类干预生成的校验列表者。 我们进一步引入了成本、多样性、实用性和正确性等层面的计量标准,以比较核对列表者。 我们深入分析了在印地语中创建核对列表者的不同方法。 此外,我们尝试了9种更不同的语言。 我们发现,通过人类核查的计算方法是将核对列表评估范围扩大到多种语言的理想方法,而TEA则对模型性能做了良好的估计。