library("car")
library("dplyr")
library("broom")
library("boot")
This example will use data from the Barro and Lee dataset to analyze life expectancy. This is a pedagogical example, and does not represent a sophisticated epidemiological model. Data are 138 countries in 1985.
barro_raw <- read.csv("http://pols503.github.io/pols_503_sp15/data/barro.csv")
Subset of barro data with only the variables we need, and dropping missing observations.
barro <- barro_raw %>%
select(lifeexp, school, gdpcap85,
civlib5, wartime) %>%
na.omit()
glimpse(barro)
## Observations: 95
## Variables:
## $ lifeexp (dbl) 61.5, 49.4, 58.3, 55.1, 49.1, 57.7, 42.4, 53.2, 38.7,...
## $ school (dbl) 2.394, 0.698, 3.700, 2.230, 1.261, 3.143, 0.837, 3.21...
## $ gdpcap85 (int) 2951, 1067, 2252, 1432, 596, 2540, 616, 759, 616, 772...
## $ civlib5 (dbl) 6.0, 6.2, 3.0, 6.2, 5.4, 6.2, 3.4, 4.4, 6.0, 4.6, 5.0...
## $ wartime (dbl) 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,...
Per capita GDP in 1985, thousands of international dollars
Average years of education
low = 1 to high = 7 scale of civil liberties
Percent of recent history spent at war
We will compare two models of life expectancy.
Model 1 regresses life expectancy on gdpcap85
, school
, civlib5
, and wartime
.
mod1 <- lm(lifeexp ~ gdpcap85 + school + civlib5 + wartime,
data = barro)
mod1
##
## Call:
## lm(formula = lifeexp ~ gdpcap85 + school + civlib5 + wartime,
## data = barro)
##
## Coefficients:
## (Intercept) gdpcap85 school civlib5 wartime
## 52.9671697 0.0003216 2.3866392 -0.7522863 1.0936902
Model 2 regresses life expectancy on the logarithm of gdpcap85
, the logarithm of school
, civlib5
, and wartime
.
mod2 <- lm(lifeexp ~ log(gdpcap85) + log(school) + civlib5 + wartime,
data = barro)
mod2
##
## Call:
## lm(formula = lifeexp ~ log(gdpcap85) + log(school) + civlib5 +
## wartime, data = barro)
##
## Coefficients:
## (Intercept) log(gdpcap85) log(school) civlib5 wartime
## 13.1893 5.3730 6.0431 -0.1843 -0.2174
We can compare the models from the previous section using \(R^2\), adjusted-\(R^2\), the standard error of the regression \(\hat\sigma^2_{\epsilon}\), AIC, and BIC.
The summary of the lm object shows the R squared values and the standard deviation of the regression (“Residual standard error”):
mod1_summary <- summary(mod1)
mod1_summary
##
## Call:
## lm(formula = lifeexp ~ gdpcap85 + school + civlib5 + wartime,
## data = barro)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2871 -3.0705 -0.0368 4.6629 12.1285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.9671697 3.0379356 17.435 < 2e-16 ***
## gdpcap85 0.0003216 0.0002551 1.260 0.211
## school 2.3866392 0.4102266 5.818 9.01e-08 ***
## civlib5 -0.7522863 0.4861266 -1.548 0.125
## wartime 1.0936902 5.5354286 0.198 0.844
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.43 on 90 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7399
## F-statistic: 67.84 on 4 and 90 DF, p-value: < 2.2e-16
These can be extracted from the object returned by summary
. Explore the object to find the elements corresponding to those values:
str(mod1_summary)
The functions AIC
and BIC
do what you would expect them to do,
AIC(mod1)
## [1] 597.9328
BIC(mod1)
## [1] 613.256
The glance
function from the broom package returns all of these within a model frame:
glance(mod1)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.750932 0.7398623 5.430069 67.83678 2.375664e-26 5 -292.9664
## AIC BIC deviance df.residual
## 1 597.9328 613.256 2653.708 90
Now let’s calculate these values for Model 2:
glance(mod2)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.8888772 0.8839384 3.627003 179.9787 4.719766e-42 5 -254.6296
## AIC BIC deviance df.residual
## 1 521.2592 536.5825 1183.964 90
By any of those measures, Model 2 is the better fitting model.
The boot package contains functions to do cross validation.
Here is an example of cross-validation using the previous models. Although glm
is used instead of lm
, it produces the same results as lm
. This is needed for the cv.glm
function to work:
mod1_glm <- glm(lifeexp ~ gdpcap85 + school + civlib5 + wartime, data = barro)
This runs 10-fold cross validation
cv_err_K10 <- cv.glm(barro, mod1_glm, K=10)
The object returned by cv.glm
includes the original function call, the value of K
, the predication error delta
, and the random number seed. For now, all we care about is the prediction error. By default, cv.glm
returns the mean squared error. There are two elements that it returns: the first is the average of the mean squared errors from the folds; the second corrects for bias occuring from using k-folds rather than leave-one-out cross-validation.
cv_err_K10$delta
## [1] 31.50753 31.31174
5-fold cross-validation
cv_err_K5 <- cv.glm(barro, mod1_glm, K=5)
cv_err_K5$delta
## [1] 30.68531 30.37128
And leave-one-out cross-validation (no K
argument is given):
mod_cv_loo <- cv.glm(barro, mod1_glm)
mod_cv_loo$delta
## [1] 31.46432 31.44491
The interpretation in these cases is that the average prediction RMSE is over 30 years. This is not good. Not good at all.
We can calculate the same for Model 2:
mod2_glm <- glm(lifeexp ~ log(gdpcap85) + log(school) + civlib5 + wartime, data = barro)
cv.glm(barro, mod2_glm, K=10)$delta
## [1] 13.49913 13.44212
cv.glm(barro, mod2_glm, K=5)$delta
## [1] 13.89080 13.72502
cv.glm(barro, mod2_glm)$delta
## [1] 13.69680 13.69006
In these cases the prediction RMSE is around 14 years. This is okay. Maybe not as good as we would like, but a lot better than missing by 30 years.
We could also try adding a squared term to war to see whether that improves the model. We’ll use a 10-fold cross-validation for it.
mod3 <- glm(lifeexp ~ log(gdpcap85) + log(school) + civlib5 + wartime + I(wartime ^ 2), data = barro)
cv.glm(barro, mod3, K = 10)$delta
## [1] 12.59070 12.52253
It shows a small improvement in the prediction RMSE. Adding squared wartime
seems to improve the model, but it is not nearly as important as getting the functional form of gdpcap85
and school
correct.
The first step is to draw values of \(\beta\) from a multvariate normal distribution centered at \(\hat\beta\) with covariance matrix \(vcov(\hat\beta)\): \[ \tilde\beta \sim N(\hat\beta, V(\hat\beta)) \] This relies on the CLT result that \(\hat\beta\) approx multivariate normal as \(n \to \infty\). We will randomly sample simulations of \(\beta\) from that distribution:
n <- 1024
simbetas <- MASS::mvrnorm(n, coef(mod2), vcov(mod2))
For log(gdpcap85)
compare it at its mean versus 1 standard deviation
barro_low <- summarize(na.omit(barro),
gdpcap85 = exp(mean(log(gdpcap85))),
school = exp(mean(log(school))),
civlib5 = mean(civlib5),
wartime = mean(wartime))
barro_high <- summarize(na.omit(barro),
gdpcap85 = exp(mean(log(gdpcap85)) + sd(log(gdpcap85))),
school = exp(mean(log(school))),
civlib5 = mean(civlib5),
wartime = mean(wartime))
Simulate and calculate a standard error around the difference
xlow <- model.matrix(~ log(gdpcap85) + log(school)
+ civlib5 + wartime, data = barro_low)
xhigh <- model.matrix(~ log(gdpcap85) + log(school)
+ civlib5 + wartime, data = barro_high)
diffs <- rep(NA, nrow(simbetas))
for (i in 1:nrow(simbetas)) {
betas <- simbetas[i, ]
diffs[i] <- xhigh %*% betas - xlow %*% betas
}
Two packages implement this method of simulatin for comparison:
The function Boot
from car can be used to bootstrap coefficients.
Boot(mod2)
##
## CASE RESAMPLING BOOTSTRAP FOR CENSORED DATA
##
##
## Call:
## boot::boot(data = data.frame(update(object, model = TRUE)$model),
## statistic = boot.f, R = R, .fn = f)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 13.1893476 -5.806878e-02 5.4902883
## t2* 5.3730192 4.712968e-03 0.6705320
## t3* 6.0430645 1.270322e-02 0.8177946
## t4* -0.1842516 3.011728e-05 0.2824878
## t5* -0.2173762 5.102515e-01 3.4657527
The boot package has more powerful and general functions.
Bootstrapping can be done manually with dplyr
functions:
simulations <- list()
for (i in 1:1024) {
# Randomly resample data
.data <- sample_frac(barro, replace = TRUE)
# Run a new regression
mod <- lm(lifeexp ~ log(gdpcap85) + log(school) + civlib5 + wartime, data = .data)
# Predict
lower <- predict(mod, newdata = barro_low)
higher <- predict(mod, newdata = barro_high)
simulations[[i]] <-
data.frame(diff = higher - lower)
}
sims <- bind_rows(simulations)
From these simulations, one way to calculate a 95% cofidence interval is just the 2.5–97.5 percentiles:
quantile(sims$diff, c(0.025, 0.975))
## 2.5% 97.5%
## 4.271793 7.087771
This code could have been streamlined using the bootstrap
function in the broom package.
Example adapted from Christopher Adolph (Spring 2014) “Linear Regression: Specification and Fitting” [Lecture slides]. http://faculty.washington.edu/cadolph/503/topic5.pw.pdf.