Private large language model (LLM) inference based on cryptographic primitives offers a promising path towards privacy-preserving deep learning. However, existing frameworks only support dense LLMs like LLaMA-1 and struggle to scale to mixture-of-experts (MoE) architectures. The key challenge comes from securely evaluating the dynamic routing mechanism in MoE layers, which may reveal sensitive input information if not fully protected. In this paper, we propose CryptoMoE, the first framework that enables private, efficient, and accurate inference for MoE-based models. CryptoMoE balances expert loads to protect expert routing information and proposes novel protocols for secure expert dispatch and combine. CryptoMoE also develops a confidence-aware token selection strategy and a batch matrix multiplication protocol to improve accuracy and efficiency further. Extensive experiments on DeepSeekMoE-16.4B, OLMoE-6.9B, and QWenMoE-14.3B show that CryptoMoE achieves $2.8\sim3.5\times$ end-to-end latency reduction and $2.9\sim4.3\times$ communication reduction over a dense baseline with minimum accuracy loss. We also adapt CipherPrune (ICLR'25) for MoE inference and demonstrate CryptoMoE can reduce the communication by up to $4.3 \times$. Code is available at: https://github.com/PKU-SEC-Lab/CryptoMoE.
翻译:基于密码学原语的私有大语言模型(LLM)推理为隐私保护的深度学习提供了一条有前景的路径。然而,现有框架仅支持如LLaMA-1等稠密LLM,难以扩展到混合专家(MoE)架构。关键挑战在于安全评估MoE层中的动态路由机制,若未得到充分保护,该机制可能泄露敏感的输入信息。本文提出CryptoMoE,这是首个支持基于MoE模型进行私有、高效且准确推理的框架。CryptoMoE通过均衡专家负载以保护专家路由信息,并提出了用于安全专家调度与聚合的新颖协议。此外,CryptoMoE开发了置信度感知的令牌选择策略和批量矩阵乘法协议,以进一步提升准确性和效率。在DeepSeekMoE-16.4B、OLMoE-6.9B和QWenMoE-14.3B上的大量实验表明,CryptoMoE在最小精度损失下,实现了端到端延迟降低$2.8\sim3.5$倍,通信开销降低$2.9\sim4.3$倍(相较于稠密基线模型)。我们还为MoE推理适配了CipherPrune(ICLR'25),并证明CryptoMoE可将通信开销降低高达$4.3$倍。代码已开源:https://github.com/PKU-SEC-Lab/CryptoMoE。