Data integration methods aim to extract low-dimensional embeddings from high-dimensional outcomes to remove unwanted variations, such as batch effects and unmeasured covariates, across heterogeneous datasets. However, multiple hypothesis testing after integration can be biased due to data-dependent processes. We introduce a robust post-integrated inference method that accounts for latent heterogeneity by utilizing control outcomes. Leveraging causal interpretations, we derive nonparametric identifiability of the direct effects using negative control outcomes. By utilizing surrogate control outcomes as an extension of negative control outcomes, we develop semiparametric inference on projected direct effect estimands, accounting for hidden mediators, confounders, and moderators. These estimands remain statistically meaningful under model misspecifications and with error-prone embeddings. We provide bias quantifications and finite-sample linear expansions with uniform concentration bounds. The proposed doubly robust estimators are consistent and efficient under minimal assumptions and potential misspecification, facilitating data-adaptive estimation with machine learning algorithms. Our proposal is evaluated using random forests through simulations and analysis of single-cell CRISPR perturbed datasets, which may contain potential unmeasured confounders.
翻译:数据集成方法旨在从高维结果中提取低维嵌入,以消除异质数据集间的不期望变异,如批次效应和未测量协变量。然而,集成后的多重假设检验可能因数据依赖过程而产生偏差。我们提出了一种稳健的后集成推断方法,通过利用控制结果来考虑潜在异质性。基于因果解释,我们利用负控制结果推导出直接效应的非参数可识别性。通过将替代控制结果作为负控制结果的扩展,我们开发了关于投影直接效应估计量的半参数推断,同时考虑了隐藏中介变量、混杂因子和调节因子。这些估计量在模型设定错误和存在误差嵌入的情况下仍保持统计意义。我们提供了偏差量化、有限样本线性展开及一致集中界。所提出的双重稳健估计量在最小假设和潜在设定错误下具有一致性和有效性,支持通过机器学习算法进行数据自适应估计。我们通过随机森林模拟和单细胞CRISPR扰动数据集分析(可能包含未测量混杂因子)对所提方法进行了评估。