Missing Value Imputation with R

Julie Josse, INRIA - Agrocampus Ouest

Post-tutorial notes

The materials used in the tutorial are available here.

Tutorial Description

The ability to easily collect and gather a large amount of data from different sources can be seen as an opportunity to better understand many processes. It has already led to breakthroughs in several application areas. However, due to the wide heterogeneity of measurements and objectives, these large databases often exhibit an extraordinary high number of missing values. Hence, in addition to scientific questions, such data also present some important methodological and technical challenges for data analyst.

The aim of this tutorial is to present an overview of the missing values literature as well as the recent improvements that caught the attention of the community due to their ability to handle large matrices with large amount of missing entries.

We will touch upon the topics of single imputation with a focus on matrix completion methods based on iterative regularized SVD, notions of confidence intervals by giving the fundamentals of multiple imputation strategies, as well as issues of visualization with incomplete data. The approaches will be illustrated with real data with continuous, binary and categorical variables using some of the main R packages Amelia, mice, missForest, missMDA, norm, softimpute, VIM.

Goals

Understand the challenges posed by missing values and the consequences of missing data on the validity of subsequent analyses
Have a snapshot of the missing values literature
Know how to use matrix completion methods
Be familiar with methods available to assess the credibility of the results obtained from incomplete data
Be able to explore and visualize heterogeneous data with missing values

Tutorial Outline

Concise introduction to the missing values theory (aims: imputation or estimation of the parameters? - missing values mechanism - EM algorithm).
Introduce single imputation with a focus on matrix completion methods based on iterative SVD such as nuclear-norm-regularized matrix approximation (Hastie, T., Mazumder, R., Lee J., Zadeh, R. 2014). Compare and contrast this with imputation based on random forests (Stekhoven, D.J. & Buhlmann, P., 2011).
Presentation of classical multiple imputation methods for statistical analysis with missing data. It includes methods using an explicit joint distribution for the data and methods using conditional modeling (multivariate imputation by chained equations - MICE). Introduction to recent proposals based on principal components methods. These methods enable us to handle large data due to their dimensionality reduction property.
Introduction of strategies available to handle missing values with heterogeneous data (mixed of binaries, categorical, continuous data).
Exploration and visualization of incomplete data with principal component analysis and multiple correspondence analysis. Visualization of the variability due to the missing values.

Through computer practicals using several R packages, participants will learn how to apply the statistical methods introduced in the course to realistic datasets from different fields (biomedical, social science, etc.)

Background Knowledge

Elementary knowledge of general statistical concepts and (linear) statistical models is assumed Basic knowledge in singular-values decomposition and principal component analysis could be useful.

For Monday

Please make sure to download the R packages missMDA, FactoMineR, VIM, missForest, Amelia and mice.

Additional References

missMDA: