IMAGINE：将多智能体系统集成于单一模型以实现复杂推理与规划 (IMAGINE: Integrating Multi-Agent System into One Model for Complex Reasoning and Planning)

Although large language models (LLMs) have made significant strides across various tasks, they still face significant challenges in complex reasoning and planning. For example, even with carefully designed prompts and prior information explicitly provided, GPT-4o achieves only a 7% Final Pass Rate on the TravelPlanner dataset in the sole-planning mode. Similarly, even in the thinking mode, Qwen3-8B-Instruct and DeepSeek-R1-671B, only achieve Final Pass Rates of 5.9% and 40%, respectively. Although well-organized Multi-Agent Systems (MAS) can offer improved collective reasoning, they often suffer from high reasoning costs due to multi-round internal interactions, long per-response latency, and difficulties in end-to-end training. To address these challenges, we propose a general and scalable framework called IMAGINE, short for Integrating Multi-Agent System into One Model. This framework not only integrates the reasoning and planning capabilities of MAS into a single, compact model, but also significantly surpass the capabilities of the MAS through a simple end-to-end training. Through this pipeline, a single small-scale model is not only able to acquire the structured reasoning and planning capabilities of a well-organized MAS but can also significantly outperform it. Experimental results demonstrate that, when using Qwen3-8B-Instruct as the base model and training it with our method, the model achieves an 82.7% Final Pass Rate on the TravelPlanner benchmark, far exceeding the 40% of DeepSeek-R1-671B, while maintaining a much smaller model size.

翻译：尽管大型语言模型（LLM）已在多项任务中取得显著进展，但在复杂推理与规划方面仍面临重大挑战。例如，即使在精心设计提示词且明确提供先验信息的情况下，GPT-4o在TravelPlanner数据集的单次规划模式下仅获得7%的最终通过率。同样，即使在思维链模式下，Qwen3-8B-Instruct与DeepSeek-R1-671B也仅分别达到5.9%和40%的最终通过率。虽然结构完善的多智能体系统（MAS）能够提供更强的集体推理能力，但由于多轮内部交互、单次响应延迟较高以及端到端训练困难等问题，其推理成本往往居高不下。为应对这些挑战，我们提出一种通用且可扩展的框架IMAGINE（全称为“将多智能体系统集成于单一模型”）。该框架不仅将MAS的推理与规划能力整合至单一紧凑模型中，更能通过简单的端到端训练显著超越MAS的能力上限。通过此流程，单一小规模模型不仅能获得完善MAS的结构化推理与规划能力，更能实现对其性能的显著超越。实验结果表明，以Qwen3-8B-Instruct为基础模型并通过本方法训练后，该模型在TravelPlanner基准测试中达到82.7%的最终通过率，远超DeepSeek-R1-671B的40%表现，同时保持更小的模型规模。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日