In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 51.5% on the originals but drops by 4.7 percentage points on surface-renaming variants, and by 12.9 percentage points on parametric variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.
翻译:本文提出了一种超越传统方法的系统性框架,通过将高级数学问题进行数学等价但语言和参数变异的压力测试,以评估大型语言模型(LLMs)的数学推理鲁棒性。这些变换使我们能够衡量LLMs对非数学扰动的敏感性,从而更准确地评估其数学推理能力。利用这一新的评估方法,我们创建了PutnamGAP——一个包含竞赛级数学问题多种数学等价变体的新基准数据集。基于该数据集,我们对多个代表性LLM系列进行了评估并检验其鲁棒性。在测试的18个商业和开源模型中,我们观察到模型在变体问题上的性能显著下降。OpenAI的旗舰推理模型O3在原始问题上得分为51.5%,但在表面重命名变体上下降4.7个百分点,在参数变体上下降12.9个百分点,而较小模型的表现更差。总体而言,结果表明所提出的新评估方法能有效深化我们对LLMs鲁棒性的理解,并为进一步提升其数学推理能力提供新的见解。