Federated learning is an emerging learning paradigm where multiple clients collaboratively train a machine learning model in a privacy-preserving manner. Personalized federated learning extends this paradigm to overcome heterogeneity across clients by learning personalized models. Recently, there have been some initial attempts to apply Transformers to federated learning. However, the impacts of federated learning algorithms on self-attention have not yet been studied. This paper investigates this relationship and reveals that federated averaging algorithms actually have a negative impact on self-attention where there is data heterogeneity. These impacts limit the capabilities of the Transformer model in federated learning settings. Based on this, we propose FedTP, a novel Transformer-based federated learning framework that learns personalized self-attention for each client while aggregating the other parameters among the clients. Instead of using a vanilla personalization mechanism that maintains personalized self-attention layers of each client locally, we develop a learn-to-personalize mechanism to further encourage the cooperation among clients and to increase the scablability and generalization of FedTP. Specifically, the learn-to-personalize is realized by learning a hypernetwork on the server that outputs the personalized projection matrices of self-attention layers to generate client-wise queries, keys and values. Furthermore, we present the generalization bound for FedTP with the learn-to-personalize mechanism. Notably, FedTP offers a convenient environment for performing a range of image and language tasks using the same federated network architecture - all of which benefit from Transformer personalization. Extensive experiments verify that FedTP with the learn-to-personalize mechanism yields state-of-the-art performance in non-IID scenarios. Our code is available online.
翻译:联邦学习是一种新兴的学习范式,在保护隐私的前提下,多个客户端协同训练机器学习模型。个性化联邦学习通过学习个性化模型来解决客户端之间的异构性问题。最近,一些研究已经开始将Transformer应用于联邦学习。然而,联邦平均算法对个注意力(self-attention)层的影响尚未研究。本文研究了这种关系并揭示了联邦平均算法实际上对个注意力存在负面影响,特别是在存在数据异构性的情况下。这些影响限制了Transformer模型在联邦学习环境下的能力。基于此,我们提出了FedTP,一种基于Transformer的联邦学习框架,可以在聚合其它参数的同时为每个客户端学习个性化的自注意力。我们开发了一个学习个性化的机制来鼓励客户端之间的合作,并增加了FedTP的可扩展性和泛化能力,而不是使用维护本地个性化自注意力层的简单个性化机制。具体而言,学习个性化是通过在服务器上学习超网络来实现的,该超网络输出自注意力层的个性化投影矩阵,以生成客户端特定的查询(query),键(keys)和值(values)。此外,我们还提出了FedTP的广义上界,其用于学习个性化机制。值得注意的是,FedTP提供了一个方便的环境,可以使用相同的联邦网络架构执行一系列图像和语言任务,这些任务都受益于Transformer个性化。广泛的实验证明,使用学习个性化机制的FedTP在非IID场景下产生了最先进的性能。我们的代码已公开。