Recently, large-scale pre-trained language models have demonstrated impressive performance on several commonsense-reasoning benchmark datasets. However, building machines with commonsense to compose realistically plausible sentences remains challenging. In this paper, we present a constrained text generation task, CommonGen associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts (e.g., {dog, frisbee, catch, throw}); the task is to generate a coherent sentence describing an everyday scenario using these concepts (e.g., "a man throws a frisbee and his dog catches it"). The CommonGen task is challenging because it inherently requires 1) relational reasoning with background commonsense knowledge, and 2) compositional generalization ability to work on unseen concept combinations. Our dataset, constructed through a combination of crowdsourced and existing caption corpora, consists of 79k commonsense descriptions over 35k unique concept-sets. Experiments show that there is a large gap between state-of-the-art text generation models (e.g., T5) and human performance. Furthermore, we demonstrate that the learned generative commonsense reasoning capability can be transferred to improve downstream tasks such as CommonsenseQA by generating additional context.
翻译:最近,大规模预先培训的语言模型在几个常识理性基准数据集中表现出了令人印象深刻的成绩。然而,建设具有常识的机器以作出现实可信的判决仍具有挑战性。在本文件中,我们提出了一个有限的文本生成任务,即与基准数据集相关的共同Gen,以明确测试机器,以获得基因化常识推理的能力。鉴于一系列共同概念(如{狗、飞蜂、捕获、投球 ), 任务在于用这些概念(例如,“一个人扔飞盘和狗抓住它” )来生成一个一致的句子,描述日常生活情景。 共同Gen的任务具有挑战性,因为它本身要求1)与背景常识知识相联系的关联推理,2)在隐性概念组合上工作的总体化能力。我们的数据集由群集和现有字幕Corbora的组合组成,由超过35k个独特概念集的79k 共识解描述组成。 实验显示,在国产文本生成模型(e.g.t5)和人类的下游能力之间存在着巨大的差距,通过共同的推理学能力来提高共同思维的能力。