EconWebArena：在真实网络环境中对经济任务上的自主智能体进行基准测试 (EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments)

We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.

翻译：我们介绍了EconWebArena，这是一个用于在真实网络环境中评估自主智能体处理复杂、多模态经济任务的基准。该基准包含来自82个权威网站的360项精选任务，涵盖宏观经济学、劳动力、金融、贸易和公共政策等领域。每项任务都要求智能体导航实时网站、解读结构化和视觉内容、与真实界面交互，并通过多步骤工作流程提取精确且具有时效性的数据。我们通过提示多个大型语言模型（LLM）生成候选任务来构建该基准，随后进行严格的人工筛选，以确保任务的清晰性、可行性和来源可靠性。与先前工作不同，EconWebArena强调对权威数据源的真实还原以及基于网络的、有依据的经济推理的必要性。我们评估了多种最先进的多模态LLM作为网络智能体的表现，分析了失败案例，并进行了消融研究以评估视觉基础、基于计划的推理和交互设计的影响。我们的结果揭示了显著的性能差距，并突显了在基础理解、导航和多模态理解方面持续存在的挑战，从而将EconWebArena定位为一个严格的经济网络智能测试平台。