AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they (1) fail to provide holistic, product-informed measures of real-world use cases such as science research; (2) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (3) do not account for confounding variables such as model cost and tool access; (4) do not provide standardized interfaces for quick agent prototyping and evaluation; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.
翻译:AI智能体有望通过自动化文献综述、实验复现、数据分析乃至提出新研究方向来彻底改变科研生产力;事实上,目前已有众多此类智能体,涵盖通用型"深度研究"系统到专业科学智能体(如AI Scientist和AIGS)。对这些智能体进行严谨评估对领域发展至关重要。然而现有基准测试在多个方面存在不足:它们(1)未能针对科学研究等实际应用场景提供整体性、产品导向的评估指标;(2)缺乏可复现的智能体工具,难以实现核心智能能力的受控比较;(3)未考虑模型成本与工具访问等混杂变量;(4)未提供标准化接口以支持快速智能体原型开发与评估;(5)缺少全面的基线智能体体系,无法准确识别实质性进展。为此,我们提出了更严谨的智能体基准测试原则与工具集。基于此,我们推出AstaBench套件,首次提供涵盖完整科学研究过程的智能体能力整体评估,包含2400余个横跨多科学领域的问题集,其中大量问题源自已部署Asta智能体的实际用户需求。本套件配备首个具备生产级检索工具的科学研究环境,支持受控可复现的评估机制,能更好控制混杂变量。同时,我们提供包含九类科学优化型Asta智能体及多种基线的完整套件。通过对22类57个智能体的大规模评估,我们发现了若干重要结论,其中最核心的是:尽管在特定单点能力上取得显著进展,AI在解决科研辅助这一系统性挑战方面仍任重道远。