The integration of data from multiple sources is increasingly used to achieve larger sample sizes and enhance population diversity. Our previous work established that, under random sampling from the same underlying population, integrating large incomplete datasets with summary-level data produces unbiased parameter estimates. In this study, we develop a novel statistical framework that enables the integration of summary-level data with information from heterogeneous data sources by leveraging auxiliary information. The proposed approach estimates study-specific sampling weights using this auxiliary information and calibrates the estimating equations to obtain the full set of model parameters. We evaluate the performance of the proposed method through simulation studies under various sampling designs and illustrate its application by reanalyzing U.S. cancer registry data combined with summary-level odds ratio estimates for selected colorectal cancer (CRC) risk factors, while relaxing the random sampling assumption.
翻译:整合多源数据以扩大样本规模并增强人群多样性的做法日益普遍。我们先前的研究已证实,在从同一基础人群中进行随机抽样的前提下,将大规模不完整数据集与汇总数据相结合能够产生无偏的参数估计。本研究提出了一种新颖的统计框架,通过利用辅助信息实现汇总数据与异构数据源信息的整合。该方法借助辅助信息估计特定研究的抽样权重,并校准估计方程以获得完整的模型参数集。我们通过多种抽样设计下的模拟研究评估了所提方法的性能,并通过重新分析美国癌症登记数据(结合选定结直肠癌风险因素的汇总水平比值比估计)展示了其应用,同时放宽了随机抽样假设。