We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.
翻译:我们提出了I-RAVEN-X,这是一个符号化基准,旨在评估大型语言模型(LLMs)与大型推理模型(LRMs)在类比和数学推理中的泛化性与鲁棒性。I-RAVEN-X在I-RAVEN的基础上,通过增加操作数复杂度、属性范围以及引入感知不确定性进行了扩展。实证结果表明,与LLMs相比,LRMs分别在更长的推理关系和更广的属性范围上实现了更高的生产力和系统性。然而,LRMs在不确定性下的推理方面仍面临显著挑战,且无法有效探索多种概率性结果。