People constantly use language to learn about the world. Computational linguists have capitalized on this fact to build large language models (LLMs) that acquire co-occurrence-based knowledge from language corpora. LLMs achieve impressive performance on many tasks, but the robustness of their world knowledge has been questioned. Here, we ask: do LLMs acquire generalized knowledge about real-world events? Using curated sets of minimal sentence pairs (n=1215), we tested whether LLMs are more likely to generate plausible event descriptions compared to their implausible counterparts. We found that LLMs systematically distinguish possible and impossible events (The teacher bought the laptop vs. The laptop bought the teacher) but fall short of human performance when distinguishing likely and unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLMs generalize well across syntactic sentence variants (active vs passive) but less well across semantic sentence variants (synonymous sentences), (iii) some, but not all LLM deviations from ground-truth labels align with crowdsourced human judgments, and (iv) explicit event plausibility information emerges in middle LLM layers and remains high thereafter. Overall, our analyses reveal a gap in LLMs' event knowledge, highlighting their limitations as generalized knowledge bases. We conclude by speculating that the differential performance on impossible vs. unlikely events is not a temporary setback but an inherent property of LLMs, reflecting a fundamental difference between linguistic knowledge and world knowledge in intelligent systems.
翻译:人们不断使用语言来了解世界。 计算语言学家利用这个事实来构建大型语言模型( LLMs), 从而从语言公司获得基于共同知识。 LLMs在很多任务上取得了令人印象深刻的成绩, 但是他们的世界知识的强健性受到质疑。 这里, 我们问: LLMs是否获得了关于现实世界事件的普及性知识? 使用一套精细的最小刑罚配对( n=1215) 我们测试LMs是否更有可能产生与其难以令人相信的不可信的语言语言对应方相比的貌似事件描述。 我们发现 LLMs系统系统地区分了可能和不可能的事件( 教师购买了笔记笔记本和笔记本购买了教师), 但是在区分可能和不太可能的事件( 保姆辅导了男孩和男孩辅导了保姆。 ) 在后续分析中, 我们显示 (i) LLMM的得分是由光和表面判决的特征驱动的, ( ) LLMsalms) 在整个统判词变变数中( ), 而不是从中间判分数( synalmalmalmalmausal) dialalalalalevild) dialde dialevation 。