Multiple Imputation (MI) is one of the most popular approaches to addressing missing values in questionnaires and surveys. MI with multivariate imputation by chained equations (MICE) allows flexible imputation of many types of data. In MICE, for each variable under imputation, the imputer needs to specify which variables should act as predictors in the imputation model. The selection of these predictors is a difficult, but fundamental, step in the MI procedure, especially when there are many variables in a data set. In this project, we explore the use of principal component regression (PCR) as a univariate imputation method in the MICE algorithm to automatically address the "many variables" problem that arises when imputing large social science data. We compare different implementations of PCR-based MICE with a correlation-thresholding strategy by means of a Monte Carlo simulation study and a case study. We find the use of PCR on a variable-by-variable basis to perform best and that it can perform closely to expertly designed imputation procedures.
翻译:多重插补(MI)是解决问卷和调查中缺失值的最流行方法之一。 MICE通过链式方程的多元插补允许灵活地插补许多类型的数据。 在MICE中,对于每个被插补的变量,插补器需要指定哪些变量应充当插补模型中的预测变量。在插补过程中选择这些预测变量是一项困难但基本的步骤,尤其是当数据集中有许多变量时。在本研究中,我们探讨了将主成分回归(PCR)作为MICE算法中的单变量插补方法来自动解决在插补大型社会科学数据时出现的“多变量”问题。通过蒙特卡罗模拟研究和案例研究,我们比较了基于PCR的MICE的不同实现与基于相关阈值策略的方法。我们发现,逐个变量使用PCR可以表现最佳,并且它的表现可以接近专家设计的插补程序。