Missing values with mixed data types is a common problem in a large number of machine learning applications such as processing of surveys and in different medical applications. Recently, Gaussian copula models have been suggested as a means of performing imputation of missing values using a probabilistic framework. While the present Gaussian copula models have shown to yield state of the art performance, they have two limitations: they are based on an approximation that is fast but may be imprecise and they do not support unordered multinomial variables. We address the first limitation using direct and arbitrarily precise approximations both for model estimation and imputation by using randomized quasi-Monte Carlo procedures. The method we provide has lower errors for the estimated model parameters and the imputed values, compared to previously proposed methods. We also extend the previous Gaussian copula models to include unordered multinomial variables in addition to the present support of ordinal, binary, and continuous variables.
翻译:缺少的数值与混合数据类型是大量机器学习应用中常见的问题,例如调查处理和不同的医疗应用中常见的问题。 最近,Gaussian Conula模型被建议为一种使用概率框架对缺失值进行估算的手段。 虽然当前的Gaussian Colula模型显示能够产生最新性能,但它们有两个局限性:它们基于快速的近似值,但可能不精确,而且不支持未经排序的多名变量。我们通过随机化准蒙太罗程序,在模型估计和估算方面直接和任意精确的近似值。我们提供的方法比先前提议的方法对估计模型参数和估算值的错误要小。我们还扩大了以前的Gaussian Coupula模型,在目前对正态、二进制和连续变量的支持之外,还包括未排序的多名变量。