The datasets and benchmarks commonly used to train and evaluate the mathematical capabilities of AI-based mathematical copilots (primarily large language models) exhibit several shortcomings and misdirections. These range from a restricted scope of mathematical complexity to limited fidelity in capturing aspects beyond the final, written proof (e.g. motivating the proof, or representing the thought processes leading to a proof). These issues are compounded by a dynamic reminiscent of Goodhart's law: as benchmark performance becomes the primary target for model development, the benchmarks themselves become less reliable indicators of genuine mathematical capability. We systematically explore these limitations and contend that enhancing the capabilities of large language models, or any forthcoming advancements in AI-based mathematical assistants (copilots or ``thought partners''), necessitates a course correction both in the design of mathematical datasets and the evaluation criteria of the models' mathematical ability. In particular, it is necessary for benchmarks to move beyond the existing result-based datasets that map theorem statements directly to proofs, and instead focus on datasets that translate the richer facets of mathematical research practice into data that LLMs can learn from. This includes benchmarks that supervise the proving process and the proof discovery process itself, and we advocate for mathematical dataset developers to consider the concept of "motivated proof", introduced by G. Pólya in 1949, which can serve as a blueprint for datasets that offer a better proof learning signal, alleviating some of the mentioned limitations.
翻译:当前用于训练和评估基于人工智能的数学副驾驶(主要是大型语言模型)数学能力的常用数据集与基准测试存在若干缺陷与误导性。这些问题既包括数学复杂性范围的局限,也涉及在捕捉最终书面证明之外方面(例如证明动机,或导致证明的思维过程表征)的保真度不足。这些问题因一种令人联想到古德哈特定律的动态而加剧:随着基准测试性能成为模型开发的主要目标,这些基准本身作为真实数学能力指标的可靠性反而下降。我们系统地探讨了这些局限性,并主张:要提升大型语言模型或任何未来基于人工智能的数学助手(副驾驶或“思维伙伴”)的能力,必须在数学数据集的设计和模型数学能力的评估标准上进行方向修正。具体而言,基准测试有必要超越现有的、将定理陈述直接映射到证明的结果型数据集,转而聚焦于将数学研究实践中更丰富的维度转化为大型语言模型可学习数据的数据集。这包括对证明过程及证明发现过程本身进行监督的基准测试。我们倡导数学数据集开发者应考虑由G.波利亚于1949年提出的“动机化证明”概念,该概念可为提供更优证明学习信号的数据集提供蓝图,从而缓解上述部分局限性。