Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable agentic reasoning systems. While large language models (LLMs) demonstrate strong performance on isolated benchmarks, their ability to transfer reasoning strategies and co-ordinate tools across diverse domains is poorly understood. In this work, we conduct a large-scale evaluation of state-of-the-art LLMs on multiple tool-calling benchmarksBFCL v3, TauBench, Tau2Bench, and AceBenchand introduce MAVEN (Math & Physics Adversarial Verification & Evaluation Network), a new out of distribution (OOD) benchmark designed to stress-test multi-step reasoning through explicit verification and adversarial task composition. Our results show that most current models achieve below 50% accuracy on MAVEN, revealing a significant generalization gap across tool-use settings. To address this, we present the CoreThink Agentic Reasoner, a framework that augments LLMs with a lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration. Without additional training, it generalizes across all benchmarks, achieving state-of-the-art performance with 530% improvements over existing baselines at roughly one-tenth the computational cost.
翻译:智能体工具调用环境间的泛化能力,仍是构建可靠智能体推理系统尚未解决的关键挑战。尽管大语言模型在独立基准测试中表现出色,但其跨领域迁移推理策略与协调工具的能力尚未得到充分理解。本研究对前沿大语言模型在多个工具调用基准(BFCL v3、TauBench、Tau2Bench 与 AceBench)上进行了大规模评估,并提出了 MAVEN(数学与物理对抗验证评估网络)——一个旨在通过显式验证与对抗性任务组合来压力测试多步推理能力的分布外基准。实验结果表明,当前多数模型在 MAVEN 上的准确率低于 50%,揭示了工具使用场景间显著的泛化差距。为应对此问题,我们提出了 CoreThink 智能体推理器框架,该框架通过轻量级符号推理层增强大语言模型,实现结构化分解与自适应工具编排。该框架无需额外训练即可泛化至所有基准测试,以约十分之一的计算成本取得了超越现有基线 530% 的性能提升,达到当前最优水平。