library("car")
library("dplyr")
library("broom")
library("boot")

This example will use data from the Barro and Lee dataset to analyze life expectancy. This is a pedagogical example, and does not represent a sophisticated epidemiological model. Data are 138 countries in 1985.

barro_raw <- read.csv("http://pols503.github.io/pols_503_sp15/data/barro.csv")

Subset of barro data with only the variables we need, and dropping missing observations.

barro <- barro_raw %>%
  select(lifeexp, school, gdpcap85,
         civlib5, wartime) %>%
  na.omit()

glimpse(barro)
## Observations: 95
## Variables:
## $ lifeexp  (dbl) 61.5, 49.4, 58.3, 55.1, 49.1, 57.7, 42.4, 53.2, 38.7,...
## $ school   (dbl) 2.394, 0.698, 3.700, 2.230, 1.261, 3.143, 0.837, 3.21...
## $ gdpcap85 (int) 2951, 1067, 2252, 1432, 596, 2540, 616, 759, 616, 772...
## $ civlib5  (dbl) 6.0, 6.2, 3.0, 6.2, 5.4, 6.2, 3.4, 4.4, 6.0, 4.6, 5.0...
## $ wartime  (dbl) 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,...
gdpcap85

Per capita GDP in 1985, thousands of international dollars

school

Average years of education

civlib5

low = 1 to high = 7 scale of civil liberties

wartime

Percent of recent history spent at war

We will compare two models of life expectancy.

Model 1 regresses life expectancy on gdpcap85, school, civlib5, and wartime.

mod1 <- lm(lifeexp ~ gdpcap85 + school + civlib5 + wartime,
           data = barro)
mod1
## 
## Call:
## lm(formula = lifeexp ~ gdpcap85 + school + civlib5 + wartime, 
##     data = barro)
## 
## Coefficients:
## (Intercept)     gdpcap85       school      civlib5      wartime  
##  52.9671697    0.0003216    2.3866392   -0.7522863    1.0936902

Model 2 regresses life expectancy on the logarithm of gdpcap85, the logarithm of school, civlib5, and wartime.

mod2 <- lm(lifeexp ~ log(gdpcap85) + log(school) + civlib5 + wartime,
           data = barro)
mod2
## 
## Call:
## lm(formula = lifeexp ~ log(gdpcap85) + log(school) + civlib5 + 
##     wartime, data = barro)
## 
## Coefficients:
##   (Intercept)  log(gdpcap85)    log(school)        civlib5        wartime  
##       13.1893         5.3730         6.0431        -0.1843        -0.2174

Comparing Models

We can compare the models from the previous section using \(R^2\), adjusted-\(R^2\), the standard error of the regression \(\hat\sigma^2_{\epsilon}\), AIC, and BIC.

The summary of the lm object shows the R squared values and the standard deviation of the regression (“Residual standard error”):

mod1_summary <- summary(mod1)
mod1_summary
## 
## Call:
## lm(formula = lifeexp ~ gdpcap85 + school + civlib5 + wartime, 
##     data = barro)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2871  -3.0705  -0.0368   4.6629  12.1285 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.9671697  3.0379356  17.435  < 2e-16 ***
## gdpcap85     0.0003216  0.0002551   1.260    0.211    
## school       2.3866392  0.4102266   5.818 9.01e-08 ***
## civlib5     -0.7522863  0.4861266  -1.548    0.125    
## wartime      1.0936902  5.5354286   0.198    0.844    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.43 on 90 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7399 
## F-statistic: 67.84 on 4 and 90 DF,  p-value: < 2.2e-16

These can be extracted from the object returned by summary. Explore the object to find the elements corresponding to those values:

str(mod1_summary)

The functions AIC and BIC do what you would expect them to do,

AIC(mod1)
## [1] 597.9328
BIC(mod1)
## [1] 613.256

The glance function from the broom package returns all of these within a model frame:

glance(mod1)
##   r.squared adj.r.squared    sigma statistic      p.value df    logLik
## 1  0.750932     0.7398623 5.430069  67.83678 2.375664e-26  5 -292.9664
##        AIC     BIC deviance df.residual
## 1 597.9328 613.256 2653.708          90

Now let’s calculate these values for Model 2:

glance(mod2)
##   r.squared adj.r.squared    sigma statistic      p.value df    logLik
## 1 0.8888772     0.8839384 3.627003  179.9787 4.719766e-42  5 -254.6296
##        AIC      BIC deviance df.residual
## 1 521.2592 536.5825 1183.964          90

By any of those measures, Model 2 is the better fitting model.

Cross Validation

The boot package contains functions to do cross validation.

Here is an example of cross-validation using the previous models. Although glm is used instead of lm, it produces the same results as lm. This is needed for the cv.glm function to work:

mod1_glm <- glm(lifeexp ~ gdpcap85 + school + civlib5 + wartime, data = barro)

This runs 10-fold cross validation

cv_err_K10 <- cv.glm(barro, mod1_glm, K=10)

The object returned by cv.glm includes the original function call, the value of K, the predication error delta, and the random number seed. For now, all we care about is the prediction error. By default, cv.glm returns the mean squared error. There are two elements that it returns: the first is the average of the mean squared errors from the folds; the second corrects for bias occuring from using k-folds rather than leave-one-out cross-validation.

cv_err_K10$delta
## [1] 31.50753 31.31174

5-fold cross-validation

cv_err_K5 <- cv.glm(barro, mod1_glm, K=5)
cv_err_K5$delta
## [1] 30.68531 30.37128

And leave-one-out cross-validation (no K argument is given):

mod_cv_loo <- cv.glm(barro, mod1_glm)
mod_cv_loo$delta
## [1] 31.46432 31.44491

The interpretation in these cases is that the average prediction RMSE is over 30 years. This is not good. Not good at all.

We can calculate the same for Model 2:

mod2_glm <- glm(lifeexp ~ log(gdpcap85) + log(school) + civlib5 + wartime, data = barro)

cv.glm(barro, mod2_glm, K=10)$delta
## [1] 13.49913 13.44212
cv.glm(barro, mod2_glm, K=5)$delta
## [1] 13.89080 13.72502
cv.glm(barro, mod2_glm)$delta
## [1] 13.69680 13.69006

In these cases the prediction RMSE is around 14 years. This is okay. Maybe not as good as we would like, but a lot better than missing by 30 years.

We could also try adding a squared term to war to see whether that improves the model. We’ll use a 10-fold cross-validation for it.

mod3 <- glm(lifeexp ~ log(gdpcap85) + log(school) + civlib5 + wartime + I(wartime ^ 2), data = barro)
cv.glm(barro, mod3, K = 10)$delta
## [1] 12.59070 12.52253

It shows a small improvement in the prediction RMSE. Adding squared wartime seems to improve the model, but it is not nearly as important as getting the functional form of gdpcap85 and school correct.

Predictive Comparisons by Simulation and Bootstrap

Simulating Coefficients from Asymptotic Normal

The first step is to draw values of \(\beta\) from a multvariate normal distribution centered at \(\hat\beta\) with covariance matrix \(vcov(\hat\beta)\): \[ \tilde\beta \sim N(\hat\beta, V(\hat\beta)) \] This relies on the CLT result that \(\hat\beta\) approx multivariate normal as \(n \to \infty\). We will randomly sample simulations of \(\beta\) from that distribution:

n <- 1024
simbetas <- MASS::mvrnorm(n, coef(mod2), vcov(mod2))

For log(gdpcap85) compare it at its mean versus 1 standard deviation

barro_low <- summarize(na.omit(barro),
                      gdpcap85 = exp(mean(log(gdpcap85))),
                      school = exp(mean(log(school))),
                      civlib5 = mean(civlib5),
                      wartime = mean(wartime))

barro_high <- summarize(na.omit(barro),
                       gdpcap85 = exp(mean(log(gdpcap85)) + sd(log(gdpcap85))),
                       school = exp(mean(log(school))),
                       civlib5 = mean(civlib5),
                       wartime = mean(wartime))

Simulate and calculate a standard error around the difference

xlow <- model.matrix(~ log(gdpcap85) + log(school)
                      + civlib5 + wartime, data = barro_low)
xhigh <- model.matrix(~ log(gdpcap85) + log(school)
                      + civlib5 + wartime, data = barro_high)
diffs <- rep(NA, nrow(simbetas))
for (i in 1:nrow(simbetas)) {
  betas <- simbetas[i, ]
  diffs[i] <- xhigh %*% betas - xlow %*% betas
}

Two packages implement this method of simulatin for comparison:

  • Zelig
  • simcf: Chris Adolph’s package written while teaching this course.

Bootstrapping

The function Boot from car can be used to bootstrap coefficients.

Boot(mod2)
## 
## CASE RESAMPLING BOOTSTRAP FOR CENSORED DATA
## 
## 
## Call:
## boot::boot(data = data.frame(update(object, model = TRUE)$model), 
##     statistic = boot.f, R = R, .fn = f)
## 
## 
## Bootstrap Statistics :
##       original        bias    std. error
## t1* 13.1893476 -5.806878e-02   5.4902883
## t2*  5.3730192  4.712968e-03   0.6705320
## t3*  6.0430645  1.270322e-02   0.8177946
## t4* -0.1842516  3.011728e-05   0.2824878
## t5* -0.2173762  5.102515e-01   3.4657527

The boot package has more powerful and general functions.

Bootstrapping can be done manually with dplyr functions:

simulations <- list()
for (i in 1:1024) {
  # Randomly resample data
  .data <- sample_frac(barro, replace = TRUE)
  # Run a new regression
  mod <- lm(lifeexp ~ log(gdpcap85) + log(school) + civlib5 + wartime, data = .data)
  # Predict
  lower <- predict(mod, newdata = barro_low)
  higher <- predict(mod, newdata = barro_high)
  simulations[[i]] <-
    data.frame(diff = higher - lower)
}
sims <- bind_rows(simulations)

From these simulations, one way to calculate a 95% cofidence interval is just the 2.5–97.5 percentiles:

quantile(sims$diff, c(0.025, 0.975))
##     2.5%    97.5% 
## 4.271793 7.087771

This code could have been streamlined using the bootstrap function in the broom package.


Example adapted from Christopher Adolph (Spring 2014) “Linear Regression: Specification and Fitting” [Lecture slides]. http://faculty.washington.edu/cadolph/503/topic5.pw.pdf.