We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA). RoMQA contains clusters of questions that are derived from related constraints mined from the Wikidata knowledge graph. RoMQA evaluates robustness of QA models to varying constraints by measuring worst-case performance within each question cluster. Compared to prior QA datasets, RoMQA has more human-written questions that require reasoning over more evidence text and have, on average, many more correct answers. In addition, human annotators rate RoMQA questions as more natural or likely to be asked by people. We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging: zero-shot and few-shot models perform similarly to naive baselines, while supervised retrieval methods perform well below gold evidence upper bounds. Moreover, existing models are not robust to variations in question constraints, but can be made more robust by tuning on clusters of related questions. Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
翻译:我们引入了强力、多证据、多答答答题的第一个基准(QA),即ROMQA。ROMQA包含来自维基数据知识图所提取相关限制的一组问题。ROMQA通过衡量每个问题组中最坏的性能,评估QA模型的稳健性与各种限制因素。与以前的QA数据集相比,ROMQA有更多的人文问题,需要推理更多的证据文本,并平均有更正确的答案。此外,人类通知员将RoMQA问题评为更自然或更可能由人问的问题。我们评估了零发、少发和微调环境中最先进的大语言模型,发现ROMQA具有挑战性:零发和少发模型与天真基线类似,而受监督的检索方法则远远低于黄金证据上限。此外,现有模型不健全,难以改变问题制约,但可以通过对相关问题组群集进行调而更有力。我们的结果显示,RMQA是一个更具有挑战性的大型语言测试模型。