Pre-trained machine learning (ML) predictions have been increasingly used to complement incomplete data to enable downstream scientific inquiries, but their naive integration risks biased inferences. Recently, multiple methods have been developed to provide valid inference with ML imputations regardless of prediction quality and to enhance efficiency relative to complete-case analyses. However, existing approaches are often limited to missing outcomes under a missing-completely-at-random (MCAR) assumption, failing to handle general missingness patterns (missing in both the outcome and exposures) under the more realistic missing-at-random (MAR) assumption. This paper develops a novel method that delivers a valid statistical inference framework for general Z-estimation problems using ML imputations under the MAR assumption and for general missingness patterns. The core technical idea is to stratify observations by distinct missingness patterns and construct an estimator by appropriately weighting and aggregating pattern-specific information through a masking-and-imputation procedure on the complete cases. We provide theoretical guarantees of asymptotic normality of the proposed estimator and efficiency dominance over weighted complete-case analyses. Practically, the method affords simple implementations by leveraging existing weighted complete-case analysis software. Extensive simulations are carried out to validate theoretical results. A real data example is provided to further illustrate the practical utility of the proposed method. The paper concludes with a brief discussion on practical implications, limitations, and potential future directions.
翻译:预训练机器学习(ML)预测结果正日益被用于补充不完整数据以支持下游科学探究,但其简单整合可能导致推断偏差。近年来,已开发出多种方法,旨在无论预测质量如何均能基于ML插值提供有效推断,并相对于完整案例分析提升效率。然而,现有方法通常局限于在完全随机缺失(MCAR)假设下处理缺失结果,无法在更现实的随机缺失(MAR)假设下应对通用缺失模式(结果与暴露变量均存在缺失)。本文提出一种新方法,为MAR假设下使用ML插值处理通用缺失模式的广义Z估计问题构建了有效的统计推断框架。其核心技术思想是通过缺失模式对观测进行分层,并通过对完整案例实施掩蔽与插值操作,以适当加权方式聚合各模式特定信息来构建估计量。我们提供了所提估计量渐近正态性的理论保证,并证明其相对于加权完整案例分析具有效率优势。在实际应用层面,该方法可通过利用现有加权完整案例分析软件实现简便部署。通过大量模拟实验验证了理论结果。文中还提供了一个真实数据案例以进一步说明该方法的实用价值。最后,本文简要讨论了实际应用意义、局限性及潜在未来研究方向。