Compositional generalization benchmarks seek to assess whether models can accurately compute meanings for novel sentences, but operationalize this in terms of logical form (LF) prediction. This raises the concern that semantically irrelevant details of the chosen LFs could shape model performance. We argue that this concern is realized for the COGS benchmark (Kim and Linzen, 2020). COGS poses generalization splits that appear impossible for present-day models, which could be taken as an indictment of those models. However, we show that the negative results trace to incidental features of COGS LFs. Converting these LFs to semantically equivalent ones and factoring out capabilities unrelated to semantic interpretation, we find that even baseline models get traction. A recent variable-free translation of COGS LFs suggests similar conclusions, but we observe this format is not semantically equivalent; it is incapable of accurately representing some COGS meanings. These findings inform our proposal for ReCOGS, a modified version of COGS that comes closer to assessing the target semantic capabilities while remaining very challenging. Overall, our results reaffirm the importance of compositional generalization and careful benchmark task design.
翻译:组合推理基准旨在评估模型是否能够准确计算新句子的含义,但是在逻辑形式(LF)预测方面操作化。这引起了担忧,即所选择的LF的语义无关细节可能会塑造模型性能。我们认为,这种担忧对于COGS基准(Kim和Linzen,2020)是成立的。COGS提出了看似对现有模型不可能的泛化分裂,这可能被视为对这些模型的控诉。但我们发现,这些负面结果可以追溯到COGS LFs的偶然特征。将这些LF转换为语义等效的LF并消除与语义解释无关的能力,我们发现即使是基准模型也具有开端。最近的COGS LFs无变量翻译提出了类似的结论,但我们观察到该格式不是语义等效的;它无法准确地表示一些COGS含义。这些发现为我们提供了ReCOGS的建议,这是COGS的修改版本,可更接近于评估目标语义能力,同时仍然非常具有挑战性。总的来说,我们的结果重新确认了组合泛化的重要性和仔细的基准任务设计。