面向高效智能体：推理架构与系统的协同设计 (Towards Efficient Agents: A Co-Design of Inference Architecture and System)

Weizhe Lin,Hui-Ling Zhen,Shuai Yang,Xian Wang,Renxi Liu,Hanting Chen,Wangze Zhang,Chuansai Zhou,Yiming Li,Chen Chen,Xing Li,Zhiyuan Yang,Xiaosong Li,Xianzhi Yu,Zhenhua Dong,Mingxuan Yuan,Yunhe Wang

The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making. However, their real-world deployment is hindered by severe inefficiencies that arise not from isolated model inference, but from the systemic latency accumulated across reasoning loops, context growth, and heterogeneous tool interactions. This paper presents AgentInfer, a unified framework for end-to-end agent acceleration that bridges inference optimization and architectural design. We decompose the problem into four synergistic components: AgentCollab, a hierarchical dual-model reasoning framework that balances large- and small-model usage through dynamic role assignment; AgentSched, a cache-aware hybrid scheduler that minimizes latency under heterogeneous request patterns; AgentSAM, a suffix-automaton-based speculative decoding method that reuses multi-session semantic memory to achieve low-overhead inference acceleration; and AgentCompress, a semantic compression mechanism that asynchronously distills and reorganizes agent memory without disrupting ongoing reasoning. Together, these modules form a Self-Evolution Engine capable of sustaining efficiency and cognitive stability throughout long-horizon reasoning tasks. Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%, achieving an overall 1.8-2.5 times speedup with preserved accuracy. These results underscore that optimizing for agentic task completion-rather than merely per-token throughput-is the key to building scalable, efficient, and self-improving intelligent systems.

翻译：基于大语言模型（LLM）的智能体的快速发展，为自主多轮推理和工具增强的决策开辟了新的可能性。然而，其在现实世界中的部署受到严重低效性的阻碍，这种低效性并非源于孤立的模型推理，而是源于跨推理循环、上下文增长以及异构工具交互所累积的系统性延迟。本文提出了AgentInfer，一个用于端到端智能体加速的统一框架，它桥接了推理优化与架构设计。我们将该问题分解为四个协同组件：AgentCollab，一个分层双模型推理框架，通过动态角色分配来平衡大模型与小模型的使用；AgentSched，一个缓存感知的混合调度器，可在异构请求模式下最小化延迟；AgentSAM，一种基于后缀自动机的推测解码方法，通过复用多会话语义记忆实现低开销的推理加速；以及AgentCompress，一种语义压缩机制，可在不中断持续推理的情况下，异步提炼和重组智能体记忆。这些模块共同构成了一个自进化引擎，能够在长视野推理任务中持续保持效率和认知稳定性。在BrowseComp-zh和DeepDiver基准测试上的实验表明，通过这些方法的协同合作，AgentInfer将无效令牌消耗降低了超过50%，在保持准确性的同时实现了整体1.8-2.5倍的加速。这些结果强调，为智能体任务完成（而非仅仅是每令牌吞吐量）进行优化，是构建可扩展、高效且自我改进的智能系统的关键。