异异异常感和通信效率高的分布式统计推断 (Heterogeneity-aware and communication-efficient distributed statistical inference)

In multicenter research, individual-level data are often protected against sharing across sites. To overcome the barrier of data sharing, many distributed algorithms, which only require sharing aggregated information, have been developed. The existing distributed algorithms usually assume the data are homogeneously distributed across sites. This assumption ignores the important fact that the data collected at different sites may come from various sub-populations and environments, which can lead to heterogeneity in the distribution of the data. Ignoring the heterogeneity may lead to erroneous statistical inference. In this paper, we propose distributed algorithms which account for the heterogeneous distributions by allowing site-specific nuisance parameters. The proposed methods extend the surrogate likelihood approach to the heterogeneous setting by applying a novel density ratio tilting method to the efficient score function. The proposed algorithms maintain the same communication cost as the existing communication-efficient algorithms. We establish a non-asymptotic risk bound for the proposed distributed estimator and its limiting distribution in the two-index asymptotic setting which allows both sample size per site and the number of sites to go to infinity. In addition, we show that the asymptotic variance of the estimator attains the Cram\'er-Rao lower bound when the number of sites is in rate smaller than the sample size at each site. Finally, we use simulation studies and a real data application to demonstrate the validity and feasibility of the proposed methods.

翻译：在多中心研究中,个人数据往往受到保护,不在不同地点共享。为了克服数据共享的障碍,已经制定了许多分布式算法,这些算法只要求共享汇总信息。现有的分布式算法通常假设数据在各地点的分布均匀分布。这一假设忽略了在不同地点收集的数据可能来自不同亚群和环境的重要事实,这可能导致数据分布的异质性。不注意差异性可能导致错误的统计误判。在本文中,我们提议了分配式算法,这种算法通过允许特定地点的混杂分布参数来计算异异性分布。提议的方法通常假定数据在各地点的分布式算法之间分布均以单一的方式分布。采用新的密度比率倾斜度方法来高效的评分函数功能,从而将超异的可能性方法扩大到异性设置。提议的算法保持与现有通信效率算法相同的通信成本。我们为拟议的分布式估测算法及其在两个指数中的分布限制性分布性误判。我们提议的分配算法的分布算法允许每个地点的样本大小,以及每个地点的精确度应用率在最终显示每个地点的精确度时,我们展示了每个地点的精确度的精确度。