多种源数据分析多源数据分析分布式分散学习 (Heterogeneity-aware Clustered Distributed Learning for Multi-source Data Analysis)

In diverse fields ranging from finance to omics, it is increasingly common that data is distributed and with multiple individual sources (referred to as ``clients'' in some studies). Integrating raw data, although powerful, is often not feasible, for example, when there are considerations on privacy protection. Distributed learning techniques have been developed to integrate summary statistics as opposed to raw data. In many of the existing distributed learning studies, it is stringently assumed that all the clients have the same model. To accommodate data heterogeneity, some federated learning methods allow for client-specific models. In this article, we consider the scenario that clients form clusters, those in the same cluster have the same model, and different clusters have different models. Further considering the clustering structure can lead to a better understanding of the ``interconnections'' among clients and reduce the number of parameters. To this end, we develop a novel penalization approach. Specifically, group penalization is imposed for regularized estimation and selection of important variables, and fusion penalization is imposed to automatically cluster clients. An effective ADMM algorithm is developed, and the estimation, selection, and clustering consistency properties are established under mild conditions. Simulation and data analysis further demonstrate the practical utility and superiority of the proposed approach.

翻译：在从金融到迷雾学等不同领域,数据分布不同,而且有多种个别来源(在某些研究中称为“客户”)。合并原始数据虽然很有力,但往往不可行,例如,在考虑隐私保护时,这种数据是强有力的。开发了分布式学习技术,以综合汇总统计数据,而不是原始数据。在许多现有的分布式学习研究中,严格地假定所有客户都有相同的模式。为了容纳数据差异性,一些联合学习方法允许客户采用不同的模式。在本篇文章中,我们考虑了客户组成集群的情况,同一集群中的客户具有相同的模式,而不同的集群有不同的模式。进一步考虑集群结构可以导致更好地了解客户之间的“相互联系”并减少参数的数目。我们为此制定了一种新的惩罚性办法。具体地说,为了定期估计和选择重要的变量,对自动集群客户实行集体惩罚,对自动集群客户实行混合惩罚。开发了有效的ADMM算法,并且对通用性特性的估算、选择和组合法性进行了不同的模型。在温和的条件下,对实用性数据进行了进一步的模拟和组合性分析。