Missing values are unavoidable when working with data. Their occurrence is exacerbated as more data from different sources become available. However, most statistical models and visualization methods require complete data, and improper handling of missing data results in information loss or biased analyses. Since the seminal work of Rubin (1976), a burgeoning literature on missing values has arisen, with heterogeneous aims and motivations. This led to the development of various methods, formalizations, and tools. For practitioners, it remains nevertheless challenging to decide which method is most suited for their problem, partially due to a lack of systematic covering of this topic in statistics or data science curricula. To help address this challenge, we have launched the "R-miss-tastic" platform, which aims to provide an overview of standard missing values problems, methods, and relevant implementations of methodologies. Beyond gathering and organizing a large majority of the material on missing data (bibliography, courses, tutorials, implementations), "R-miss-tastic" covers the development of standardized analysis workflows. Indeed, we have developed several pipelines in R and Python to allow for hands-on illustration of and recommendations on missing values handling in various statistical tasks such as matrix completion, estimation and prediction, while ensuring reproducibility of the analyses. Finally, the platform is dedicated to users who analyze incomplete data, researchers who want to compare their methods and search for an up-to-date bibliography, and also teachers who are looking for didactic materials (notebooks, video, slides).
翻译:与数据合作时,缺失的值是不可避免的。随着来自不同来源的更多数据出现,其出现会更加严重。然而,大多数统计模式和可视化方法都需要完整的数据,而且对缺失的数据处理不当,从而导致信息丢失或分析偏差。自从鲁宾(1976年)的开创性工作以来,出现了关于缺失值的新兴文献,其目标和动机各异。这导致开发了各种方法、正规化和工具。对于从业人员来说,仍然难以决定哪种方法最适合于他们的问题,部分原因是统计或数据科学课程中没有系统地涵盖这一专题。然而,为了帮助应对这一挑战,我们启动了“失传数据”平台,目的是概述标准缺失值问题、方法和有关方法的实施。除了收集并组织大量关于缺失数据的材料(目录、课程、教程、实施)、“失传-塔奇”包括标准化分析工作流程的开发。事实上,我们在R和Python开发了几个管道,以利应对这一挑战,我们启动了“失传”平台,旨在提供直接的图解图解、方法和相关方法。在统计分析过程中,对各种数据进行了不全的用户进行了统计分析,并进行了数据分析。