高效混合智能体服务：基于树状结构路由、自适应剪枝与依赖感知的预填充-解码重叠 (Efficient Mixture-of-Agents Serving via Tree-Structured Routing, Adaptive Pruning, and Dependency-Aware Prefill-Decode Overlap)

Mixture-of-Agents (MoA) inference can suffer from dense inter-agent communication and low hardware utilization, which jointly inflate serving latency. We present a serving design that targets these bottlenecks through an algorithm-system co-design. First, we replace dense agent interaction graphs with a hierarchical tree topology that induces structured sparsity in inter-agent communication. Second, we introduce a runtime adaptive mechanism that selectively terminates or skips downstream agent invocations using semantic agreement and confidence signals from intermediate outputs. Third, we pipeline agent execution by overlapping incremental prefilling with decoding across dependency-related agents, improving utilization and reducing inference latency. Across representative tasks, this approach substantially reduces end-to-end latency (up to 90%) while maintaining comparable accuracy (within $\pm$1%) relative to dense-connectivity MoA baselines, and can improve accuracy in certain settings.

翻译：混合智能体（MoA）推理可能面临密集的智能体间通信与低硬件利用率的问题，共同导致服务延迟增加。本文提出一种通过算法-系统协同设计针对这些瓶颈的服务方案。首先，我们将密集的智能体交互图替换为层次化树状拓扑，从而在智能体间通信中引入结构化稀疏性。其次，我们设计了一种运行时自适应机制，利用中间输出的语义一致性与置信度信号，选择性终止或跳过下游智能体的调用。第三，我们通过在有依赖关系的智能体间重叠增量式预填充与解码过程，实现智能体执行的流水线化，从而提升利用率并降低推理延迟。在典型任务上的实验表明，相较于全连接MoA基线，该方法在保持相当准确率（波动在±1%范围内）的同时，显著降低了端到端延迟（最高达90%），并在特定场景下还能提升准确率。