We present a large, challenging dataset, COUGH, for COVID-19 FAQ retrieval. Similar to a standard FAQ dataset, COUGH consists of three parts: FAQ Bank, Query Bank and Relevance Set. The FAQ Bank contains ~16K FAQ items scraped from 55 credible websites (e.g., CDC and WHO). For evaluation, we introduce Query Bank and Relevance Set, where the former contains 1,236 human-paraphrased queries while the latter contains ~32 human-annotated FAQ items for each query. We analyze COUGH by testing different FAQ retrieval models built on top of BM25 and BERT, among which the best model achieves 48.8 under P@5, indicating a great challenge presented by COUGH and encouraging future research for further improvement. Our COUGH dataset is available at https://github.com/sunlab-osu/covid-faq.
翻译:我们为COVID-19 FAQ检索提供了一个庞大的、具有挑战性的数据集,COUGH。类似于标准的FAQ数据集,COUGH由三部分组成:FAQ Bank、Query Bank and International Set。FAQ Bank 包含从55个可信的网站(如CDC和WHO)中剪掉的~16K FAQ项目。为了评估,我们引入了Query Bank and Internity Set, 前者包含1,236个人类口号查询,而后者包含每份查询的~32个人类附加说明的FAQ项目。我们通过测试建在BM25和BERT上方的FAQ检索模型来分析COUGH,其中最佳模型在P@5下达到48.8,表明COUGH提出了巨大的挑战,并鼓励今后的研究进一步改进。我们的COUGH数据集可在https://github.com/sunlab-osu/covid-faq查阅。