The expansion of large language models is increasingly limited by the constrained memory capacity of modern GPUs. To mitigate this, Mixture-of-Experts (MoE) architectures activate only a small portion of parameters during inference, significantly lowering both memory demand and computational overhead. However, conventional MoE inference approaches, which select active experts independently at each layer, often introduce considerable latency because of frequent parameter transfers between host and GPU memory. In addition, current cross-layer prediction strategies, which are typically based on fixed steps, lack adaptability across different hardware platforms and workloads, thereby reducing their robustness and effectiveness. To address these challenges, we present ExpertFlow, a runtime system for MoE inference that combines adaptive expert prefetching and cache-aware routing. ExpertFlow continuously adjusts its prediction horizon for expert activation by leveraging runtime statistics such as transfer bandwidth, parameter dimensionality, and model feedback signals. Furthermore, it incorporates a hybrid cross-layer prediction scheme that fuses pregating information with intermediate computational states to anticipate future expert needs. By adaptively refining prefetching decisions and aligning them with actual usage behavior, ExpertFlow effectively decreases cache misses and removes latency caused by expert swap-ins. Our evaluation demonstrates that ExpertFlow reduces model stall time to less than 0.1% of the baseline, highlighting its capability to optimize MoE inference under stringent memory constraints.
翻译:现代GPU有限的内存容量日益成为大型语言模型规模扩展的主要制约因素。为缓解此问题,混合专家(Mixture-of-Experts, MoE)架构在推理时仅激活少量参数,显著降低了内存需求和计算开销。然而,传统MoE推理方法在各层独立选择激活专家的策略,常因主机与GPU内存间频繁的参数传输而引入显著延迟。此外,现有基于固定步长的跨层预测策略缺乏对不同硬件平台和工作负载的适应性,从而削弱了其鲁棒性与有效性。为应对这些挑战,我们提出ExpertFlow——一个结合自适应专家预取与缓存感知路由的MoE推理运行时系统。ExpertFlow通过实时利用传输带宽、参数维度及模型反馈信号等运行时统计量,持续调整专家激活的预测范围。该系统还融合了混合跨层预测机制,将前置门控信息与中间计算状态相结合以预测未来专家需求。通过自适应优化预取决策并使其与实际使用行为对齐,ExpertFlow有效减少了缓存未命中,并消除了专家换入导致的延迟。实验评估表明,ExpertFlow将模型停滞时间降低至基线水平的0.1%以下,突显了其在严格内存约束下优化MoE推理的能力。