电子健康记录与汽车校考员的 " 代表性学习 " :比较研究 (Representation Learning with Autoencoders for Electronic Health Records: A Comparative Study)

from arxiv, Reason: This submission is the extension of our other research which has already submitted in arXiv (arXiv:1801.02961), therefore we decided update that version and withdraw this submission

Increasing volume of Electronic Health Records (EHR) in recent years provides great opportunities for data scientists to collaborate on different aspects of healthcare research by applying advanced analytics to these EHR clinical data. A key requirement however is obtaining meaningful insights from high dimensional, sparse and complex clinical data. Data science approaches typically address this challenge by performing feature learning in order to build more reliable and informative feature representations from clinical data followed by supervised learning. In this paper, we propose a predictive modeling approach based on deep learning based feature representations and word embedding techniques. Our method uses different deep architectures (stacked sparse autoencoders, deep belief network, adversarial autoencoders and variational autoencoders) for feature representation in higher-level abstraction to obtain effective and robust features from EHRs, and then build prediction models on top of them. Our approach is particularly useful when the unlabeled data is abundant whereas labeled data is scarce. We investigate the performance of representation learning through a supervised learning approach. Our focus is to present a comparative study to evaluate the performance of different deep architectures through supervised learning and provide insights in the choice of deep feature representation techniques. Our experiments demonstrate that for small data sets, stacked sparse autoencoder demonstrates a superior generality performance in prediction due to sparsity regularization whereas variational autoencoders outperform the competing approaches for large data sets due to its capability of learning the representation distribution

翻译：近年来,电子健康记录(EHR)数量不断增加,这为数据科学家提供了巨大的机会,通过对这些EHR临床数据应用先进的分析,在保健研究的不同方面进行合作。然而,一项关键要求是从高维、稀少和复杂的临床数据中获得有意义的深刻见解。数据科学方法通常通过进行特征学习来应对这一挑战,以便从临床数据中建立更加可靠和更加丰富的特征描述,然后由监督的学习进行学习。在本文中,我们提议基于深学习基于地物特征表现和嵌入字词的技术的预测模型方法。我们的方法使用不同的深层结构(稀疏的自动计算机师、深信仰网络、对立的自动计算机师和变异自动计算机师),以便从高层次的抽象数据中获取具有实效和稳健的特征特征,然后在高层次的临床数据中建立预测模型。我们的方法特别有用,因为没有贴标签的数据是稀缺的数据。我们通过监督的学习方法调查代表性学习的绩效。我们的重点是提出比较研究,通过监督的学习来评估不同深层建筑的绩效,并在选择深度地貌结构中提供洞见的高级数据结构的升级变现。我们的业绩实验。