Large language models (LLMs) are proliferating rapidly at the edge, delivering intelligent capabilities across diverse application scenarios. However, their practical deployment in collaborative scenarios confronts fundamental challenges: privacy vulnerabilities, communication overhead, and computational bottlenecks. To address these, we propose Federated Attention (FedAttn), which integrates the federated paradigm into the self-attention mechanism, creating a new distributed LLM inference framework that simultaneously achieves privacy protection, communication efficiency, and computational efficiency. FedAttn enables participants to perform local self-attention over their own token representations while periodically exchanging and aggregating Key-Value (KV) matrices across multiple Transformer blocks, collaboratively generating LLM responses without exposing private prompts. Further, we identify a structural duality between contextual representation refinement in FedAttn and parameter optimization in FL across private data, local computation, and global aggregation. This key insight provides a principled foundation for systematically porting federated optimization techniques to collaborative LLM inference. Building on this framework, we theoretically analyze how local self-attention computation within participants and heterogeneous token relevance among participants shape error propagation dynamics across Transformer blocks. Moreover, we characterize the fundamental trade-off between response quality and communication/computation efficiency, which is governed by the synchronization interval and the number of participants. Experimental results validate our theoretical analysis, and reveal significant optimization opportunities through sparse attention and adaptive KV aggregation, highlighting FedAttn's potential to deliver scalability and efficiency in real-world edge deployments.
翻译:大语言模型(LLMs)在边缘侧迅速普及,为多样化应用场景提供智能能力。然而,其在协作场景中的实际部署面临根本性挑战:隐私脆弱性、通信开销与计算瓶颈。为此,我们提出联邦注意力(FedAttn),将联邦范式整合至自注意力机制中,构建了一种新型分布式LLM推理框架,同步实现隐私保护、通信效率与计算效率。FedAttn使参与者能够对其自身的令牌表示执行本地自注意力计算,同时跨多个Transformer块周期性地交换并聚合键值(KV)矩阵,从而在不暴露私有提示的情况下协作生成LLM响应。进一步,我们揭示了FedAttn中上下文表示精炼与联邦学习中基于私有数据、本地计算和全局聚合的参数优化之间的结构对偶性。这一关键见解为系统性地将联邦优化技术迁移至协作式LLM推理提供了原则性基础。基于此框架,我们从理论上分析了参与者内部的本地自注意力计算以及参与者间异构令牌相关性如何影响跨Transformer块的误差传播动态。此外,我们刻画了响应质量与通信/计算效率之间的基本权衡关系,该关系受同步间隔和参与者数量调控。实验结果验证了我们的理论分析,并揭示了通过稀疏注意力与自适应KV聚合带来的显著优化机会,凸显了FedAttn在实际边缘部署中实现可扩展性与高效性的潜力。