Summary of data imputation techniques in R

ANDRES TOBAR
8 min readJun 30, 2021

--

Figure 1: Data imputation, taken from DBSCAN

Introduction.

Missing data are observations for which there is a lack of certain information on the variables in a data set. The existence of missing values usually generates doubts about the quality of the data set, increases the probability of committing type I and II errors, reduces statistical power, and limits the reliability of confidence intervals (Striner,2002).

In practice, the most general way to explain the behavior of missing data is usually associated with the behavior of conducting surveys, in the case of people invited to participate not wishing to respond, they can be considered as missing data causing a poor response rate that does not allow to proceed with the analyses. However, when participants omit responses to items within a data set, there are some statistical procedures available for researchers to replace or impute, missing values with reasonable estimates (Van Buuren,2018).

Various forms of data imputation have been developed starting from premises as simple as ignoring them to the most sophisticated ones that implement statistical and machine learning (ML) techniques, however, in general, no imputation model is the most recommended, but it depends purely on the case of application and to avoid statistical bias when performing the analysis.

Types of missing values.

Introducing the concept of missingness mechanism as the cause of the existence of missing values, most data imputation methods require that the occurrence of missing values can be explained as random or due to observed values in other variables, providing information about the missing data and the randomness of the missingness (Schafer and Graham, 2002).

  • Missing data completely at random (MCAR): The existence of missing values is independent of the data, for example, when a random sample is taken from a population where each individual has the same probability of belonging to the sample, the missing data of the sample members with MCAR. The frequency with which this happens is close to zero.
  • Missing data at random(MAR): The probability of being missing value is the same within the groups defined by the observed data, for example, when sampling the population is considered, the probability of being included depends on some known property. This type of missing data is the most common and most commonly used.
  • Non-random missing data: The probability of being missing data is due to unknown reasons. Taking as a case of public opinion where people with weak arguments respond less frequently by creating blanks or leaving empty answers.

Data imputation techniques.

Several ways of dealing with missing data have been proposed, considering techniques that can be considered basic to those that can be considered complex due to the sophistication of the concepts used in data imputation. Some of these techniques are shown below.

Description of the data set and library integration.

The iris data set gives the measurements in centimeters of the variables length and width of sepals and length and width of petals, respectively, for 50 flowers of each of the 3 iris species. The species are Iris setosa, versicolor and virginica.

install.package(tidyverse) #data manipulation
install.package(missForest) #missing generation
install.package(VIM) #Missing visualitation
install.package(mice) #data imputation
install.package(modeest)#compute the mode
#Activate packages
library("tidyverse","missForest","mice","VIM","modeest")

Both the original database and the database with missing data are constructed.

do<-iris
summary(do)
Figure 2: Summary of data original (do).
df<- prodNA(do, noNA = 0.1)
summary(df)
Figure 3: Summary of data with missing values (df).

Calculate the percentage of missing data for columns and rows.

pperdidos <- function(x){round(sum(is.na(x))/length(x)*100,2)}
apply(df,2,pperdidos)
Figure 4: Percentage of missing values by column.
apply(df,1,pperdidos)
Figure 5: Percentage of missing values by row.

Complete case analysis.

Figure 6: Complete case example.

It consists of considering only the observations that have all their fields complete, i.e. missing values are ignored. In practice, the use of this technique is not recommended unless there is a very small proportion of missing values (less than 5%); it can be used for any type of missing value, but it can generate statistical bias if the type of missing values isn´t MCAR.

pos<-which(apply(df,1,pperdidos)>0)
cc<-df[-pos,]
dim(df)
Figure 7: Dimensions of data with missing values.
dim(cc)
Figure 8: Data dimensions considering complete cases.
apply(cc,2,pperdidos)
Figure 9: Figure 4: Percentage of missing values by column on complete cases data.

Do nothing.

It consists of allowing the algorithm to handle the missing data, it may have an imputation method implemented in its operation, it may omit the missing data or send an error message because it cannot perform the required operation. Its advantage is that it can be used regardless of the type of missing data, but it can cause problems when working with algorithms and statistical bias.

#Let's select only the numerical data for explanatory purposes.
do2<-do %>% select(-Species)
df2<-df %>% select(-Species)
#Let's calculate different measures
#mean
apply(do2,2,mean)
Figure 9: Average of the columns in the original data.
apply(df2,2,mean)
Figure 10: Average of the columns without omission of missing values.
apply(df2,2,mean,na.rm=T)
Figure 11: Average of the columns with the omission of missing values.

Deductive Imputation.

It consists of the deduction of missing values from the available information, in general, its use is recommended when there is a small amount of missing data. For example, when the date of birth is available and there are missing observations in the age variable. In small quantities it is a good tool, however, it is often not possible to use it.

Imputation of the mean/median/mode.

It consists of replacing the missing values with the values of the mean, median, or mode, in general, it is the most commonly used due to its ease of implementation, it is not recommended since it reduces the variability of the data because the same quantity is entered for a large number of observations. It should never be used with MNAR data.

df2$Sepal.Length[is.na(df2$Sepal.Length)]<-mean(df2$Sepal.Length,na.rm = T)
df2$Sepal.Width[is.na(df2$Sepal.Width)]<-median(df2$Sepal.Width,na.rm = T)
df2$Petal.Length[is.na(df2$Petal.Length)]<-mfv(df2$Petal.Length[!is.na(df2$Petal.Length)])
df2$Petal.Width[is.na(df2$Petal.Width)]<-mean(df2$Petal.Width)
Figure 12: Histogram of original data vs. data with imputed values

Multiple imputations

Figure 13: Multiple imputations, taking from towards

Multiple imputations consist of the creation of several different plausible imputed data sets and the appropriate combination of the results obtained from each of them. These are sampled from their predicted distribution based on the observed data; therefore, multiple imputations are based on a Bayesian approach. The imputation procedure must take full account of all uncertainty in predicting missing values by injecting appropriate variability into the multiple imputed values; we can never know the true values of the missing data.

The second step is to use standard statistical methods to fit the model of interest to each of the imputed data sets. The estimated associations in each of the imputed data sets will differ due to the variation introduced in the imputation of missing values and are only useful when averaged together to give overall estimated associations. Standard errors are calculated using Rubin’s (1976) rules that take into account the variability of results between imputed data sets, reflecting the uncertainty associated with missing values. Valid inferences are obtained because we are averaging the distribution of missing data given the observed data.

The R mice package has a set of techniques that allow us to impute missing values with plausible data values. These plausible values are drawn from a distribution designed specifically for each missing data point. To intuitively summarize the missing data findings, we will make use of the library named above together with the VIM library.

aggr(df, col=c('#1A5276','#7B241C'), numbers=TRUE, sortVars=TRUE, labels=names(df), cex.axis=.7, gap=3, ylab=c("Histograma de datos perdidos","Comportamiento"))
Figure 14: Summary of missing values.

The methods implemented within the package are as follows:

methods(mice)
Figure 15: Available methods in the mice package

Predictive Mean Matching (PMM)

PMM involves selecting a data point from the original non-missing data that has a predicted value close to the predicted value of the missing sample. The closest N values are chosen as candidates, from which a value is chosen at random. As shown in the figure below:

Figure 16: PMM’s operation, taking from The MICE Algorithm
dimp<- mice(df,m=5,maxit=50,meth='pmm',seed=500)
  • m: number of imputed data sets
  • maxit: the maximum number of iterations.
  • meth: the method to be applied
summary(dimp)
Figure 17: Summary with the results of the algorithm.

To observe the imputed data:

dimp$imp$Sepal.Length
Figure 18: Imputed values for Sepal.Length observations
dimp$imp$Species
Figure 19: Imputed values for Species observations

To complete the data from the previous calculations, the “complete” function is used, indicating the index of the imputed data set, in case you only want to work with one data set.

Imputed data analysis.

xyplot (dimp, Sepal.Length ~  Sepal.Width + Petal.Length+Petal.Width , pch = 18, cex = 1)
Figure 20: Scatter plot of numeric variables with known and imputed values
xyplot (dimp, Species ~  Sepal.Width + Petal.Length+Petal.Width+Sepal.Length , pch = 18, cex = 1)
Figure 21: Scatter plot Species vs the rest of the variables with known and imputed values

In both graphs (Figure 19 and 20) it can be noticed how the imputed values (magenta) coincide with the known values (blue), therefore, it can be considered that this imputation is plausible. One way of asserting this is the density plot.

densityplot(dimp)
Figure 22: Density plots of numeric variables with known and imputed values.

To conclude that the imputations are plausible, we want the distribution of the known data to be similar to that of the imputed data. Similarly, if we want to know which data set to select when using the complete function, we can analyze which curves best fit the known data. Finally, the stripplot() function can be used to analyze the distributions of the variables as individual points.

stripplot (dimp, pch = 20, cex = 1.2)
Figure 23: Strip plot of numeric variables with known and imputed values.

Bibliography

  1. Stef Van Buuren, 2018, “Flexible imputation of missing data”, “Chapman & Hall/CRC”.
  2. Schafer, J. L., and J. W. Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7 (2): 147–77
  3. Abayomi, K., A. Gelman, and M. Levy. 2008. “Diagnostics for Multivariate Imputations.” Journal of the Royal Statistical Society C 57 (3): 273–91.
  4. DONALD B. RUBIN, Inference and missing data, Biometrika, Volume 63, Issue 3, December 1976, Pages 581–592, https://doi.org/10.1093/biomet/63.3.581

--

--

ANDRES TOBAR

Soy egresado de la Escuela Politécnica Nacional, próximo a la obtención del título de ingeniero matemático.