In pharmacoepidemiology, safety and effectiveness are frequently evaluated using readily available administrative and electronic health records data. In these settings, detailed confounder data are often not available in all data sources and therefore missing on a subset of individuals. Multiple imputation (MI) and inverse-probability weighting (IPW) are go-to analytical methods to handle missing data and are dominant in the biomedical literature. Doubly-robust methods, which are consistent under fewer assumptions, can be more efficient with respect to mean-squared error. We discuss two practical-to-implement doubly-robust estimators, generalized raking and inverse probability-weighted targeted maximum likelihood estimation (TMLE), which are both currently under-utilized in biomedical studies. We compare their performance to IPW and MI in a detailed numerical study for a variety of synthetic data-generating and missingness scenarios, including scenarios with rare outcomes and a high missingness proportion. Further, we consider plasmode simulation studies that emulate the complex data structure of a large electronic health records cohort in order to compare anti-depressant therapies in a rare-outcome setting where a key confounder is prone to more than 50\% missingness. We provide guidance on selecting a missing data analysis approach, based on which methods excelled with respect to the bias-variance trade-off across the different scenarios studied.
翻译:在药物流行病学中,安全性和有效性常使用现成的行政管理和电子健康记录数据进行评估。在这些场景中,详细的混杂因素数据通常无法在所有数据源中获得,因此部分个体的数据存在缺失。多重插补(MI)和逆概率加权(IPW)是处理缺失数据的常用分析方法,在生物医学文献中占主导地位。双重稳健方法在更少的假设下具有一致性,且在均方误差方面可能更高效。我们讨论了两种易于实现的实用双重稳健估计量——广义校正加权和逆概率加权目标最大似然估计(TMLE),这两种方法目前在生物医学研究中均未得到充分利用。我们通过详细的数值研究,在多种合成数据生成和缺失场景(包括罕见结局和高缺失比例场景)中,比较了它们与IPW和MI的性能。此外,我们采用模拟真实复杂数据结构的质体模拟研究,模拟大型电子健康记录队列的复杂数据结构,以在关键混杂因素缺失率超过50%的罕见结局场景中比较抗抑郁疗法。基于不同研究场景中各种方法在偏差-方差权衡方面的表现优劣,我们为缺失数据分析方法的选择提供了指导。