Evaluating reasoning ability in Large Language Models (LLMs) is important for advancing artificial intelligence, as it transcends mere linguistic task performance. It involves understanding whether these models truly understand information, perform inferences, and are able to draw conclusions in a logical and valid way. This study compare logical and abstract reasoning skills of several LLMs - including GPT, Claude, DeepSeek, Gemini, Grok, Llama, Mistral, Perplexity, and Sabi\'a - using a set of eight custom-designed reasoning questions. The LLM results are benchmarked against human performance on the same tasks, revealing significant differences and indicating areas where LLMs struggle with deduction.
翻译:评估大型语言模型(LLMs)的推理能力对于推动人工智能发展至关重要,因为它超越了单纯的语言任务表现。这涉及理解这些模型是否真正理解信息、进行推断,并能够以逻辑有效的方式得出结论。本研究通过一套包含八个定制推理问题的测试集,比较了包括GPT、Claude、DeepSeek、Gemini、Grok、Llama、Mistral、Perplexity和Sabi'a在内的多个LLMs的逻辑与抽象推理能力。将LLMs的结果与人类在同一任务上的表现进行基准对比,揭示了显著差异,并指出了LLMs在演绎推理方面存在困难的领域。