Multiple Imputation

Required Packages and Datasets

In this lesson we will use the Amelia package and a subset of Beatrice Magistro’s dataset, with data from the European Social Survey.

# install.packages("Amelia")
library(Amelia)
library(ggplot2)
library(dplyr)
ess_sub <- read.csv("data/ess_sub.csv")

Each row in the dataset is an indivdual’s response to the survey and it has the following variables:

variable	description
`stfdem`	Satisfaction with democracy: {1-10 scale} (ordinal)
`year`	Year: {2002, 2003, …, 2012} (time variable)
`cntry`	Country: {DE, GB, …, NL} (cross section variable)
`crisis`	Crisis: {post, pre} (ordinal)
`age_gr`	Age group: {18-34, 35-64, +65} (ordinal)
`edulvla`	Education level: {low, medium, high} (ordinal)
`gndr`	Gender: {Men, Women} (categorical)
`peripherial`	Peripherial countries: {core, peri} (categorical)

Introduction

Often the datasets we use to test our theories and hypotheses have some, and sometimes numerous, missing values (NA). What do we do?

One option would be to drop the rows that have a missing value for one of our covariates and/or the variable of interest. For example, when we estimate a linear model, R automatically gets rid of the rows with missing values for the variables in the model (“listwise deletion”). In this case 7,344 observations.

model <- lm(stfdem ~ crisis + age_gr + edulvla + gndr + peripherial,
            data = ess_sub)
summary(model)

## 
## Call:
## lm(formula = stfdem ~ crisis + age_gr + edulvla + gndr + peripherial, 
##     data = ess_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0930 -1.6938  0.1299  1.8842  5.6123 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.87012    0.01879 312.469  < 2e-16 ***
## crisispre        0.22285    0.01177  18.931  < 2e-16 ***
## age_gr18-34     -0.10679    0.01754  -6.087 1.15e-09 ***
## age_gr35-64     -0.20238    0.01500 -13.488  < 2e-16 ***
## edulvlalow      -0.71168    0.01522 -46.752  < 2e-16 ***
## edulvlamedium   -0.62842    0.01382 -45.470  < 2e-16 ***
## gndrWomen       -0.24200    0.01134 -21.349  < 2e-16 ***
## peripherialperi -0.32635    0.01316 -24.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.414 on 182566 degrees of freedom
##   (7344 observations deleted due to missingness)
## Multiple R-squared:  0.02418,    Adjusted R-squared:  0.02414 
## F-statistic: 646.1 on 7 and 182566 DF,  p-value: < 2.2e-16

However, despite not having a value for a particular variable or set of variables, thoses row may provide information about the other variables. Moreover, dropping those observation may result in biased and inefficient estimates.

The other option is to impute (“fill in”) the missing values. In this lesson we will see one method and R package to do so: Amelia. This method assumes that the data:

follows a multivariate normal distribution
is missing at random -MAR- (which means that the missingness depends only on the observed data)

Then the Amelia uses a bootstrap-EM algorithm (EMB) to estimate/impute the missing values.

Exploring missing values

The summary() function provides you with information about the number of missing values per variable. This dataset has missing values for the outcome variable (stfdem) and the covariates edulvla and gndr.

summary(ess_sub)

##      stfdem            year          cntry         crisis      
##  Min.   : 0.000   Min.   :2002   DE     : 16145   post: 68948  
##  1st Qu.: 4.000   1st Qu.:2004   GB     : 12521   pre :120970  
##  Median : 5.000   Median :2008   IE     : 12324                
##  Mean   : 5.189   Mean   :2007   PT     : 11774                
##  3rd Qu.: 7.000   3rd Qu.:2010   FI     : 11260                
##  Max.   :10.000   Max.   :2012   NL     : 11048                
##  NA's   :6405                    (Other):114846                
##    age_gr         edulvla         gndr        peripherial  
##  +65  : 40674   high  :55013   Men  : 88053   core:137997  
##  18-34: 46887   low   :61545   Women:101752   peri: 51921  
##  35-64:102357   medium:72443   NA's :   113                
##                 NA's  :  917                               
##                                                            
##                                                            
##

The Amelia package has a function that helps visualizing the missing data in a dataset: missmap(). Warning: takes few minutes to run, depending on the size of the dataset.

missmap(ess_sub)

Imputation

The amelia() function takes the following parameters:

x: the dataset (e.g. ess_sub)
m: number of imputed datasets to create (e.g. 5, see Fox, p.564)
logs: a vector with variables that are log transformations
logstc: a vector with variables that are logistic transformations
noms: a vector with variables that are nominal (e.g. gnder, peripherial)
ords: a vector with variables that are ordinal (e.g. stfdem)
ts: name of the variable indicating time (for time series data) (e.g. year)
cs: name of the cross section variable (for cross section data) (e.g. cntry)
idvars: name of a variablde indicating ID, so Amelia can ignore it

The amelia() function will create m (so 5) new datasets with imputed values for the all missing values iness_sub.

m <- 5
amelia_output <- amelia(ess_sub, m = m, ts = "year" , cs = "cntry",
                      ords = c("stfdem", "crisis", "age_gr","edulvla"),
                      noms = c("gndr", "peripherial"))

## -- Imputation 1 --
## 
##   1  2
## 
## -- Imputation 2 --
## 
##   1  2
## 
## -- Imputation 3 --
## 
##   1  2
## 
## -- Imputation 4 --
## 
##   1  2
## 
## -- Imputation 5 --
## 
##   1  2

The imputed datasets are in the imputations list within the amelia_output. You can combine them into one dataset using the following loop:

ess_all <- NULL
for (i in 1:length(amelia_output$imputations)) {
  imp <- as.data.frame(amelia_output$imputations[i])
  colnames(imp) <- colnames(ess_sub)
  imp$imp <- paste0("imp", i)
  ess_all <- rbind(ess_all, imp)
}

You can also access a matrix indicating the original missing values in the amelia_output. This will be useful to compare differences between actual and imputed data and to make judgements about the quality/plausibility of the imputed data.

missMatrix <- amelia_output$missMatrix

Diagnostics

For the variables with missing values (e.g. stfdem, edulvla), explore differences across imputed datasets.

ggplot(ess_all %>%
         group_by(imp,stfdem)%>%
         summarize(n = n()), aes(y = n, x = factor(stfdem))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), vjust = 0.5, y = 1000, hjust = 0,position = ,
            angle = 90, color = "white") +
  facet_wrap(~ imp)

ggplot(ess_all %>%
         group_by(imp,edulvla)%>%
         summarize(n = n()), aes(y = n, x = factor(edulvla))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), vjust = 0.5, y = 20000, hjust = 0.5,position = ,
            color = "white") +
  facet_wrap(~ imp)

Check Jeff Arnold’s lesson on Multiple Imputation to see other ways of evaluating the plausibility of the imputated data.

Combining imputated datasets in your analysis

See Fox p.656 for details about how to combine the coefficients from different imputated datasets.

The Amelia package has a function to do so: mi.meld(). The function takes 2 paramters:

q: a dataset/matrix with k rows (k = number of imputated datasets) and v columns (v = number of covariates in the model). This dataset contains the coefficients for all covariates across imputated datasets.
se: a dataset/matrix with k rows (k = number of imputated datasets) and v columns (v = number of covariates in the model). This dataset contains the standard errors for all covariates across imputated datasets.

Calculate q and se

q <- NULL
se <- NULL
form <- formula(stfdem ~ + crisis + age_gr + edulvla + gndr + peripherial)
for (i in 1:m) {
  model <- lm(form, data = amelia_output$imputations[[i]])
  q <- rbind(q, coef(model))
  se <- rbind(se, coef(summary(model))[,2])
}
q

##      (Intercept) crisispre age_gr18-34 age_gr35-64 edulvlalow
## [1,]    5.874327 0.2180225 -0.11535668  -0.2099701 -0.6956348
## [2,]    5.867108 0.2221274 -0.11024628  -0.2062861 -0.6938102
## [3,]    5.867050 0.2222165 -0.09941342  -0.2009758 -0.6930619
## [4,]    5.868573 0.2184105 -0.10471958  -0.2029573 -0.6904310
## [5,]    5.864418 0.2221892 -0.10514010  -0.1998890 -0.6960684
##      edulvlamedium  gndrWomen peripherialperi
## [1,]    -0.6297112 -0.2389228      -0.3270229
## [2,]    -0.6276297 -0.2380649      -0.3295379
## [3,]    -0.6298683 -0.2419523      -0.3363500
## [4,]    -0.6307592 -0.2403058      -0.3271767
## [5,]    -0.6299122 -0.2383574      -0.3305466

se

##      (Intercept)  crisispre age_gr18-34 age_gr35-64 edulvlalow
## [1,]  0.01855185 0.01160933  0.01719708  0.01468296 0.01498298
## [2,]  0.01855029 0.01161053  0.01719835  0.01468398 0.01498446
## [3,]  0.01854366 0.01160459  0.01719172  0.01467780 0.01497655
## [4,]  0.01854561 0.01160602  0.01719158  0.01467775 0.01498019
## [5,]  0.01855177 0.01160988  0.01719767  0.01468360 0.01498452
##      edulvlamedium  gndrWomen peripherialperi
## [1,]    0.01372982 0.01118389      0.01293630
## [2,]    0.01372948 0.01118494      0.01293854
## [3,]    0.01372459 0.01117896      0.01292999
## [4,]    0.01372468 0.01118066      0.01293270
## [5,]    0.01372955 0.01118407      0.01293638

cb_model <- mi.meld(q = q, se = se)
cb_model_df <- data.frame(varname = colnames(cb_model[[1]]),
                          coef = as.numeric(cb_model[[1]]),
                          se = as.numeric(cb_model[[2]]),
                          coef.lwr = as.numeric(cb_model[[1]] - (2 * as.numeric(cb_model[[2]]))),
                          coef.upr = as.numeric(cb_model[[1]] + (2 * as.numeric(cb_model[[2]]))))
cb_model_df

##           varname       coef         se   coef.lwr    coef.upr
## 1     (Intercept)  5.8682952 0.01898387  5.8303274  5.90626290
## 2       crisispre  0.2205932 0.01184989  0.1968934  0.24429300
## 3     age_gr18-34 -0.1069752 0.01842968 -0.1438346 -0.07011585
## 4     age_gr35-64 -0.2040157 0.01536028 -0.2347362 -0.17329510
## 5      edulvlalow -0.6938013 0.01518468 -0.7241706 -0.66343191
## 6   edulvlamedium -0.6295761 0.01378656 -0.6571492 -0.60200299
## 7       gndrWomen -0.2395207 0.01132061 -0.2621619 -0.21687945
## 8 peripherialperi -0.3301268 0.01358620 -0.3572992 -0.30295442