This study proposes Tool-RoCo, a novel benchmark for evaluating large language models (LLMs) in long-term multi-agent cooperation based on RoCo, a multi-robot cooperative benchmark. Recent research on LLM-based multi-agent systems has relied on predefined orchestration, while ignoring agent autonomy. Tool-RoCo treats other agents as tools and introduces cooperative tools, leveraging tool usage to evaluate multi-agent cooperation and self-organization. Tool usage means that each agent (LLM) selects a tool from a candidate set based on the current state, receives feedback, and adjusts its selection in subsequent rounds. To evaluate different autonomy levels, we propose four LLM paradigms: (1) centralized cooperation, where a single LLM allocates tools to all agents; (2) centralized self-organization, where a central LLM autonomously activates agents while keeping others inactive; (3) decentralized cooperation, where each agent has its own LLM and calls tools based on local information; and (4) self-organization, where a randomly chosen initial agent can request collaboration, activating additional agents via tool calls. Tool-RoCo includes three multi-robot tasks, SORT, PACK, and CABINET, to measure format and parameter accuracy and agent coordination through tool usage. The results using several LLMs showed that cooperative tools accounted for only 7.09% of all tools, indicating that LLM-based agents rarely invoked others as assistants. Moreover, activation tools accounted for 96.42%, suggesting that current LLMs tend to maintain active agents while seldom deactivating them for adaptive coordination. Tool-RoCo provides a systematic benchmark to evaluate LLM autonomy and cooperation in multi-agent tasks. Code and Demo: https://github.com/ColaZhang22/Tool-Roco
翻译:本研究提出了Tool-RoCo,一个基于多机器人协作基准RoCo的新型基准,用于评估大语言模型在长期多智能体协作中的表现。近期基于大语言模型的多智能体系统研究多依赖于预定义的编排机制,而忽视了智能体的自主性。Tool-RoCo将其他智能体视为工具,并引入协作工具的概念,通过工具使用来评估多智能体协作与自组织能力。工具使用指每个智能体(大语言模型)根据当前状态从候选工具集中选择工具,接收反馈,并在后续轮次中调整其选择。为评估不同自主性水平,我们提出了四种大语言模型范式:(1)集中式协作:单个大语言模型为所有智能体分配工具;(2)集中式自组织:中央大语言模型自主激活部分智能体,同时保持其他智能体非活动状态;(3)分布式协作:每个智能体拥有独立的大语言模型,基于局部信息调用工具;(4)自组织:随机选择的初始智能体可请求协作,通过工具调用激活其他智能体。Tool-RoCo包含SORT、PACK和CABINET三项多机器人任务,通过工具使用衡量格式与参数准确性以及智能体协调能力。使用多种大语言模型的实验结果表明,协作工具仅占所有工具的7.09%,说明基于大语言模型的智能体很少将其他智能体作为辅助工具调用。此外,激活工具占比达96.42%,表明当前大语言模型倾向于维持智能体活动状态,而很少通过停用智能体实现自适应协调。Tool-RoCo为评估多智能体任务中大语言模型的自主性与协作能力提供了系统性基准。代码与演示:https://github.com/ColaZhang22/Tool-Roco