Although large language models (LLMs) have made significant strides across various tasks, they still face significant challenges in complex reasoning and planning. For example, even with carefully designed prompts and prior information explicitly provided, GPT-4o achieves only a 7% Final Pass Rate on the TravelPlanner dataset in the sole-planning mode. Similarly, even in the thinking mode, Qwen3-8B-Instruct and DeepSeek-R1-671B, only achieve Final Pass Rates of 5.9% and 40%, respectively. Although well-organized Multi-Agent Systems (MAS) can offer improved collective reasoning, they often suffer from high reasoning costs due to multi-round internal interactions, long per-response latency, and difficulties in end-to-end training. To address these challenges, we propose a general and scalable framework called IMAGINE, short for Integrating Multi-Agent System into One Model. This framework not only integrates the reasoning and planning capabilities of MAS into a single, compact model, but also significantly surpass the capabilities of the MAS through a simple end-to-end training. Through this pipeline, a single small-scale model is not only able to acquire the structured reasoning and planning capabilities of a well-organized MAS but can also significantly outperform it. Experimental results demonstrate that, when using Qwen3-8B-Instruct as the base model and training it with our method, the model achieves an 82.7% Final Pass Rate on the TravelPlanner benchmark, far exceeding the 40% of DeepSeek-R1-671B, while maintaining a much smaller model size.
翻译:尽管大型语言模型(LLM)已在多项任务中取得显著进展,但在复杂推理与规划方面仍面临重大挑战。例如,即使在精心设计提示词且明确提供先验信息的情况下,GPT-4o在TravelPlanner数据集的单次规划模式下仅获得7%的最终通过率。同样,即使在思维链模式下,Qwen3-8B-Instruct与DeepSeek-R1-671B也仅分别达到5.9%和40%的最终通过率。虽然结构完善的多智能体系统(MAS)能够提供更强的集体推理能力,但由于多轮内部交互、单次响应延迟较高以及端到端训练困难等问题,其推理成本往往居高不下。为应对这些挑战,我们提出一种通用且可扩展的框架IMAGINE(全称为“将多智能体系统集成于单一模型”)。该框架不仅将MAS的推理与规划能力整合至单一紧凑模型中,更能通过简单的端到端训练显著超越MAS的能力上限。通过此流程,单一小规模模型不仅能获得完善MAS的结构化推理与规划能力,更能实现对其性能的显著超越。实验结果表明,以Qwen3-8B-Instruct为基础模型并通过本方法训练后,该模型在TravelPlanner基准测试中达到82.7%的最终通过率,远超DeepSeek-R1-671B的40%表现,同时保持更小的模型规模。