Language model probing is often used to test specific capabilities of these models. However, conclusions from such studies may be limited when the probing benchmarks are small and lack statistical power. In this work, we introduce new, larger datasets for negation (NEG-1500-SIMP) and role reversal (ROLE-1500) inspired by psycholinguistic studies. We dramatically extend existing NEG-136 and ROLE-88 benchmarks using GPT3, increasing their size from 18 and 44 sentence pairs to 750 each. We also create another version of extended negation dataset (NEG-1500-SIMP-TEMP), created using template-based generation. It consists of 770 sentence pairs. We evaluate 22 models on the extended datasets, seeing model performance dip 20-57% compared to the original smaller benchmarks. We observe high levels of negation sensitivity in models like BERT and ALBERT demonstrating that previous findings might have been skewed due to smaller test sets. Finally, we observe that while GPT3 has generated all the examples in ROLE-1500 is only able to solve 24.6% of them during probing.
翻译:摘要:语言模型探究经常被用来测试这些模型的特定能力。然而,当探测基准小且缺乏统计能力时,这些研究的结论可能受到限制。在这项工作中,我们通过使用GPT3显著扩展已有的NEG-136和ROLE-88基准,将它们的大小从18和44个句子对扩大到750个,引入了新的、更大的数据集,这些数据集受到心理语言学研究的启发,包括否定(NEG-1500-SIMP)和角色翻转(ROLE-1500)。我们还创建了另一个扩展否定数据集(NEG-1500-SIMP-TEMP),使用基于模板的生成创建,其中包含770个句子对。我们在扩展的数据集上评估了22个模型,看到模型的性能下降了20-57%,与原始较小的基准相比。我们观察到像BERT和ALBERT这样的模型具有高度的否定敏感性,表明以前的发现可能由于测试集较小而产生偏差。最后,我们观察到,虽然GPT3已经生成了ROLE-1500中的所有示例,在探测中只能解决24.6%的问题。