In this lesson we will use the Amelia
package and a subset of Beatrice Magistro’s dataset, with data from the European Social Survey
.
# install.packages("Amelia")
library(Amelia)
library(ggplot2)
library(dplyr)
ess_sub <- read.csv("data/ess_sub.csv")
Each row in the dataset is an indivdual’s response to the survey and it has the following variables:
variable | description |
---|---|
stfdem |
Satisfaction with democracy: {1-10 scale} (ordinal) |
year |
Year: {2002, 2003, …, 2012} (time variable) |
cntry |
Country: {DE, GB, …, NL} (cross section variable) |
crisis |
Crisis: {post, pre} (ordinal) |
age_gr |
Age group: {18-34, 35-64, +65} (ordinal) |
edulvla |
Education level: {low, medium, high} (ordinal) |
gndr |
Gender: {Men, Women} (categorical) |
peripherial |
Peripherial countries: {core, peri} (categorical) |
Often the datasets we use to test our theories and hypotheses have some, and sometimes numerous, missing values (NA
). What do we do?
One option would be to drop the rows that have a missing value for one of our covariates and/or the variable of interest. For example, when we estimate a linear model, R
automatically gets rid of the rows with missing values for the variables in the model (“listwise deletion”). In this case 7,344 observations.
model <- lm(stfdem ~ crisis + age_gr + edulvla + gndr + peripherial,
data = ess_sub)
summary(model)
##
## Call:
## lm(formula = stfdem ~ crisis + age_gr + edulvla + gndr + peripherial,
## data = ess_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0930 -1.6938 0.1299 1.8842 5.6123
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.87012 0.01879 312.469 < 2e-16 ***
## crisispre 0.22285 0.01177 18.931 < 2e-16 ***
## age_gr18-34 -0.10679 0.01754 -6.087 1.15e-09 ***
## age_gr35-64 -0.20238 0.01500 -13.488 < 2e-16 ***
## edulvlalow -0.71168 0.01522 -46.752 < 2e-16 ***
## edulvlamedium -0.62842 0.01382 -45.470 < 2e-16 ***
## gndrWomen -0.24200 0.01134 -21.349 < 2e-16 ***
## peripherialperi -0.32635 0.01316 -24.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.414 on 182566 degrees of freedom
## (7344 observations deleted due to missingness)
## Multiple R-squared: 0.02418, Adjusted R-squared: 0.02414
## F-statistic: 646.1 on 7 and 182566 DF, p-value: < 2.2e-16
However, despite not having a value for a particular variable or set of variables, thoses row may provide information about the other variables. Moreover, dropping those observation may result in biased and inefficient estimates.
The other option is to impute (“fill in”) the missing values. In this lesson we will see one method and R package
to do so: Amelia
. This method assumes that the data:
Then the Amelia
uses a bootstrap-EM algorithm (EMB) to estimate/impute the missing values.
The summary()
function provides you with information about the number of missing values per variable. This dataset has missing values for the outcome variable (stfdem
) and the covariates edulvla
and gndr
.
summary(ess_sub)
## stfdem year cntry crisis
## Min. : 0.000 Min. :2002 DE : 16145 post: 68948
## 1st Qu.: 4.000 1st Qu.:2004 GB : 12521 pre :120970
## Median : 5.000 Median :2008 IE : 12324
## Mean : 5.189 Mean :2007 PT : 11774
## 3rd Qu.: 7.000 3rd Qu.:2010 FI : 11260
## Max. :10.000 Max. :2012 NL : 11048
## NA's :6405 (Other):114846
## age_gr edulvla gndr peripherial
## +65 : 40674 high :55013 Men : 88053 core:137997
## 18-34: 46887 low :61545 Women:101752 peri: 51921
## 35-64:102357 medium:72443 NA's : 113
## NA's : 917
##
##
##
The Amelia
package has a function that helps visualizing the missing data in a dataset: missmap()
. Warning: takes few minutes to run, depending on the size of the dataset.
missmap(ess_sub)
The amelia()
function takes the following parameters:
x
: the dataset (e.g. ess_sub)m
: number of imputed datasets to create (e.g. 5, see Fox, p.564)logs
: a vector with variables that are log
transformationslogstc
: a vector with variables that are logistic
transformationsnoms
: a vector with variables that are nominal (e.g. gnder
, peripherial
)ords
: a vector with variables that are ordinal (e.g. stfdem
)ts
: name of the variable indicating time (for time series data) (e.g. year
)cs
: name of the cross section variable (for cross section data) (e.g. cntry
)idvars
: name of a variablde indicating ID, so Amelia
can ignore itThe amelia()
function will create m
(so 5) new datasets with imputed values for the all missing values iness_sub
.
m <- 5
amelia_output <- amelia(ess_sub, m = m, ts = "year" , cs = "cntry",
ords = c("stfdem", "crisis", "age_gr","edulvla"),
noms = c("gndr", "peripherial"))
## -- Imputation 1 --
##
## 1 2
##
## -- Imputation 2 --
##
## 1 2
##
## -- Imputation 3 --
##
## 1 2
##
## -- Imputation 4 --
##
## 1 2
##
## -- Imputation 5 --
##
## 1 2
The imputed datasets are in the imputations
list within the amelia_output
. You can combine them into one dataset using the following loop:
ess_all <- NULL
for (i in 1:length(amelia_output$imputations)) {
imp <- as.data.frame(amelia_output$imputations[i])
colnames(imp) <- colnames(ess_sub)
imp$imp <- paste0("imp", i)
ess_all <- rbind(ess_all, imp)
}
You can also access a matrix indicating the original missing values in the amelia_output
. This will be useful to compare differences between actual and imputed data and to make judgements about the quality/plausibility of the imputed data.
missMatrix <- amelia_output$missMatrix
For the variables with missing values (e.g. stfdem
, edulvla
), explore differences across imputed datasets.
ggplot(ess_all %>%
group_by(imp,stfdem)%>%
summarize(n = n()), aes(y = n, x = factor(stfdem))) +
geom_bar(stat = "identity") +
geom_text(aes(label = n), vjust = 0.5, y = 1000, hjust = 0,position = ,
angle = 90, color = "white") +
facet_wrap(~ imp)
ggplot(ess_all %>%
group_by(imp,edulvla)%>%
summarize(n = n()), aes(y = n, x = factor(edulvla))) +
geom_bar(stat = "identity") +
geom_text(aes(label = n), vjust = 0.5, y = 20000, hjust = 0.5,position = ,
color = "white") +
facet_wrap(~ imp)
Check Jeff Arnold’s lesson on Multiple Imputation to see other ways of evaluating the plausibility of the imputated data.
See Fox p.656 for details about how to combine the coefficients from different imputated datasets.
The Amelia
package has a function to do so: mi.meld()
. The function takes 2 paramters:
q
: a dataset/matrix with k rows (k = number of imputated datasets) and v columns (v = number of covariates in the model). This dataset contains the coefficients for all covariates across imputated datasets.se
: a dataset/matrix with k rows (k = number of imputated datasets) and v columns (v = number of covariates in the model). This dataset contains the standard errors for all covariates across imputated datasets.Calculate q
and se
q <- NULL
se <- NULL
form <- formula(stfdem ~ + crisis + age_gr + edulvla + gndr + peripherial)
for (i in 1:m) {
model <- lm(form, data = amelia_output$imputations[[i]])
q <- rbind(q, coef(model))
se <- rbind(se, coef(summary(model))[,2])
}
q
## (Intercept) crisispre age_gr18-34 age_gr35-64 edulvlalow
## [1,] 5.874327 0.2180225 -0.11535668 -0.2099701 -0.6956348
## [2,] 5.867108 0.2221274 -0.11024628 -0.2062861 -0.6938102
## [3,] 5.867050 0.2222165 -0.09941342 -0.2009758 -0.6930619
## [4,] 5.868573 0.2184105 -0.10471958 -0.2029573 -0.6904310
## [5,] 5.864418 0.2221892 -0.10514010 -0.1998890 -0.6960684
## edulvlamedium gndrWomen peripherialperi
## [1,] -0.6297112 -0.2389228 -0.3270229
## [2,] -0.6276297 -0.2380649 -0.3295379
## [3,] -0.6298683 -0.2419523 -0.3363500
## [4,] -0.6307592 -0.2403058 -0.3271767
## [5,] -0.6299122 -0.2383574 -0.3305466
se
## (Intercept) crisispre age_gr18-34 age_gr35-64 edulvlalow
## [1,] 0.01855185 0.01160933 0.01719708 0.01468296 0.01498298
## [2,] 0.01855029 0.01161053 0.01719835 0.01468398 0.01498446
## [3,] 0.01854366 0.01160459 0.01719172 0.01467780 0.01497655
## [4,] 0.01854561 0.01160602 0.01719158 0.01467775 0.01498019
## [5,] 0.01855177 0.01160988 0.01719767 0.01468360 0.01498452
## edulvlamedium gndrWomen peripherialperi
## [1,] 0.01372982 0.01118389 0.01293630
## [2,] 0.01372948 0.01118494 0.01293854
## [3,] 0.01372459 0.01117896 0.01292999
## [4,] 0.01372468 0.01118066 0.01293270
## [5,] 0.01372955 0.01118407 0.01293638
cb_model <- mi.meld(q = q, se = se)
cb_model_df <- data.frame(varname = colnames(cb_model[[1]]),
coef = as.numeric(cb_model[[1]]),
se = as.numeric(cb_model[[2]]),
coef.lwr = as.numeric(cb_model[[1]] - (2 * as.numeric(cb_model[[2]]))),
coef.upr = as.numeric(cb_model[[1]] + (2 * as.numeric(cb_model[[2]]))))
cb_model_df
## varname coef se coef.lwr coef.upr
## 1 (Intercept) 5.8682952 0.01898387 5.8303274 5.90626290
## 2 crisispre 0.2205932 0.01184989 0.1968934 0.24429300
## 3 age_gr18-34 -0.1069752 0.01842968 -0.1438346 -0.07011585
## 4 age_gr35-64 -0.2040157 0.01536028 -0.2347362 -0.17329510
## 5 edulvlalow -0.6938013 0.01518468 -0.7241706 -0.66343191
## 6 edulvlamedium -0.6295761 0.01378656 -0.6571492 -0.60200299
## 7 gndrWomen -0.2395207 0.01132061 -0.2621619 -0.21687945
## 8 peripherialperi -0.3301268 0.01358620 -0.3572992 -0.30295442