POLS/CSSS 503, University of Washington, Spring 2015
As always, start the lab by loading the packages we will be using.
library("car")
library("lmtest")
library("MASS")
library("dplyr")
library("ggplot2")
library("broom")
Once more we’ll use data from Ross on oil, this time from the earlier 2001 paper.
rossdata_url <- "http://staff.washington.edu/csjohns/503/rossoildata.csv"
rossdata_raw <- read.csv(rossdata_url, stringsAsFactors = FALSE)
rossdata <- rossdata_raw %>% select(cty_name, year, regime1,
oil, GDPcap, oecd) %>% na.omit()
Let’s run the standard model from Ross for the relationship between oil and GDP.
model1 <- lm(regime1 ~ GDPcap + oil + oecd + year, data = rossdata)
summary(model1)
##
## Call:
## lm(formula = regime1 ~ GDPcap + oil + oecd + year, data = rossdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8284 -2.5065 -0.2086 2.2014 8.0464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.736e+02 1.664e+01 -10.433 < 2e-16 ***
## GDPcap 6.177e-05 1.823e-05 3.387 0.000716 ***
## oil -8.262e-02 5.022e-03 -16.450 < 2e-16 ***
## oecd 4.116e+00 1.928e-01 21.346 < 2e-16 ***
## year 9.008e-02 8.413e-03 10.708 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.932 on 2589 degrees of freedom
## Multiple R-squared: 0.4285, Adjusted R-squared: 0.4276
## F-statistic: 485.2 on 4 and 2589 DF, p-value: < 2.2e-16
Now predict the expected value of regime type vs. oil:
pred1_df <- data.frame(oil = seq(0, 100, by = 10), GDPcap = mean(rossdata$GDPcap),
oecd = 0, year = median(rossdata$year))
pred1 <- predict(model1, newdata = pred1_df, interval = "confidence")
ggplot(bind_cols(pred1_df, as.data.frame(pred1)), aes(x = oil,
y = fit, ymin = lwr, ymax = upr)) + geom_ribbon(alpha = 0.2) +
geom_line()
This is easiest to plot with augment
instead
pred1_aug <- augment(model1, newdata = pred1_df)
ggplot(pred1_aug, aes(x = oil, y = .fitted, ymin = .fitted +
2 * .se.fit, ymax = .fitted - 2 * .se.fit)) + geom_ribbon(alpha = 0.2) +
geom_line()
Challenge: Do this with both OECD and non-OECD countries. Before plotting, how will these differ?
Answer:
pred2_df <- expand.grid(oil = seq(0, 100, by = 10), GDPcap = mean(rossdata$GDPcap),
oecd = c(0, 1), year = median(rossdata$year))
pred2 <- predict(model1, newdata = pred2_df, interval = "confidence")
ggplot(bind_cols(pred2_df, as.data.frame(pred2)), aes(x = oil,
y = fit, ymin = lwr, ymax = upr, colour = factor(oecd), fill = factor(oecd))) +
geom_ribbon(alpha = 0.2, colour = NA) + geom_line()
Let’s plot with augment
instead
pred1_aug <- augment(model1, newdata = pred2_df)
ggplot(pred1_aug, aes(x = oil, y = .fitted, ymin = .fitted +
2 * .se.fit, ymax = .fitted - 2 * .se.fit, fill = factor(oecd),
colour = factor(oecd))) + geom_ribbon(alpha = 0.2, colour = NA) +
geom_line()
Challenge: Does it make sense to plot oil
from 0 to 100? Check the range of oil
.
Answer: Surprisingly, not that bad.
ggplot(rossdata, aes(x = oil)) + geom_density() + geom_rug()
Challenge: Interact oil
with oecd
and plot. What will the lines look like?
model2 <- lm(regime1 ~ GDPcap + oil * oecd + year, data = rossdata)
summary(model2)
##
## Call:
## lm(formula = regime1 ~ GDPcap + oil * oecd + year, data = rossdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7411 -2.5075 -0.1755 2.2446 8.0701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.732e+02 1.663e+01 -10.410 < 2e-16 ***
## GDPcap 5.931e-05 1.826e-05 3.248 0.00118 **
## oil -8.344e-02 5.035e-03 -16.571 < 2e-16 ***
## oecd 4.005e+00 2.003e-01 19.998 < 2e-16 ***
## year 8.985e-02 8.408e-03 10.686 < 2e-16 ***
## oil:oecd 8.863e-02 4.340e-02 2.042 0.04124 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.93 on 2588 degrees of freedom
## Multiple R-squared: 0.4294, Adjusted R-squared: 0.4283
## F-statistic: 389.5 on 5 and 2588 DF, p-value: < 2.2e-16
Now consider interact oil
with year
. How would you plot two continuous variables?
model3 <- lm(regime1 ~ GDPcap + oecd + oil * year, data = rossdata)
summary(model3)
##
## Call:
## lm(formula = regime1 ~ GDPcap + oecd + oil * year, data = rossdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7659 -2.4662 -0.2211 2.2458 7.5283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.900e+02 1.721e+01 -11.039 < 2e-16 ***
## GDPcap 6.228e-05 1.819e-05 3.424 0.000628 ***
## oecd 4.108e+00 1.924e-01 21.351 < 2e-16 ***
## oil 4.400e+00 1.242e+00 3.543 0.000403 ***
## year 9.835e-02 8.700e-03 11.304 < 2e-16 ***
## oil:year -2.263e-03 6.269e-04 -3.609 0.000313 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.925 on 2588 degrees of freedom
## Multiple R-squared: 0.4313, Adjusted R-squared: 0.4302
## F-statistic: 392.6 on 5 and 2588 DF, p-value: < 2.2e-16
In order to interpret the interaction term graphically, we need to pick one variable to treat as continuous — the x-axis — and another for which we will choose specific values. Use oil
as the x-axis, and some values of year
at which to evaluate it. For year
we could use the min, max, and median.
pred3_newdata <- expand.grid(oecd = 0, GDPcap = mean(rossdata$GDPcap),
year = c(1966, 1981, 1997), oil = seq(0, 100, by = 10))
pred3 <- augment(model3, newdata = pred3_newdata)
Now, let’s plot it,
ggplot(pred3, aes(x = oil, y = .fitted, ymin = .fitted - 2 *
.se.fit, ymax = .fitted + 2 * .se.fit, )) + geom_line(aes(colour = factor(year))) +
geom_ribbon(aes(fill = factor(year)), alpha = 1/3) + scale_colour_discrete("year") +
scale_fill_discrete("year") + labs(x = "Percent oil", y = "Expected level of democracy",
title = "Changing relationship of oil and democracy")
Challenge Interact GDPpc
with oil and visualize the interactions.
A common way of comparing the effect of a variable is to compare the results on \(\hat{y}\) for a variable at its mean and at its mean plus one standard deviation, holding all other variables at their means (if continuous), and their
year_seq <- seq(min(rossdata$year), max(rossdata$year), by = 1)
oil_lo <- mean(rossdata$GDPcap)
oil_hi <- mean(rossdata$GDPcap) + sd(rossdata$GDPcap)
pred3_oil_lo <- predict(model3, newdata = data.frame(year = year_seq,
GDPcap = mean(rossdata$GDPcap), oil = oil_lo, oecd = 0),
interval = "confidence")
pred3_oil_hi <- predict(model3, newdata = data.frame(year = year_seq,
GDPcap = mean(rossdata$GDPcap), oil = oil_hi, oecd = 0),
interval = "confidence")
The first difference is calculated. Note that there are no standard error associated with these differences.
pred3_diff <- data.frame(year = year_seq, diff = pred3_oil_hi[,
"fit"] - pred3_oil_lo[, "fit"])
head(pred3_diff)
## year diff
## 1 1966 -259.0802
## 2 1967 -271.1370
## 3 1968 -283.1938
## 4 1969 -295.2507
## 5 1970 -307.3075
## 6 1971 -319.3643
We could plot these differences over time.
ggplot(pred3_diff, aes(x = year, y = diff)) + geom_point() +
geom_hline(yintercept = 0, colour = "red") + labs(x = "Year",
y = "Change in democracy", title = "Change in democracy from high to low oil, over time")
Challenge:
What if we suspect the effect of oil is non-linear?
Specifying a model:
model4 <- lm(regime1 ~ GDPcap + oil + I(oil^2) + oecd, data = rossdata)
summary(model4)
##
## Call:
## lm(formula = regime1 ~ GDPcap + oil + I(oil^2) + oecd, data = rossdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6194 -2.7171 -0.1169 2.1843 7.8092
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.593e+00 8.301e-02 55.329 <2e-16 ***
## GDPcap 1.878e-04 1.425e-05 13.180 <2e-16 ***
## oil -1.084e-01 1.213e-02 -8.935 <2e-16 ***
## I(oil^2) 2.691e-04 1.941e-04 1.387 0.166
## oecd 3.171e+00 1.759e-01 18.034 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.995 on 2589 degrees of freedom
## Multiple R-squared: 0.4036, Adjusted R-squared: 0.4027
## F-statistic: 438 on 4 and 2589 DF, p-value: < 2.2e-16
oil_hyp <- seq(from = 0.01, to = 100, by = 10)
pred4 <- predict(model4, newdata = data.frame(oil = oil_hyp,
GDPcap = mean(rossdata$GDPcap), oecd = 0), interval = "confidence") %>%
as.data.frame()
pred4$oil <- oil_hyp
Plotting the resulting expectations:
ggplot(pred4, aes(x = oil)) + geom_line(aes(y = fit)) + geom_ribbon(aes(ymax = upr,
ymin = lwr), alpha = 1/3) + labs(x = "Percent oil", y = "Expected level of democracy",
title = "Relationship of oil and democracy, with quadratic transformation of oil")
Challenge: Transform GDP per capita by log.
Answer: Specifying a model:
model_logGDP <- lm(regime1 ~ log(GDPcap) + oil + oecd, data = rossdata)
summary(model_logGDP)
##
## Call:
## lm(formula = regime1 ~ log(GDPcap) + oil + oecd, data = rossdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.169 -2.125 -0.077 2.010 7.539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.46656 0.42426 -12.88 <2e-16 ***
## log(GDPcap) 1.43853 0.05761 24.97 <2e-16 ***
## oil -0.11005 0.00470 -23.42 <2e-16 ***
## oecd 2.22692 0.15951 13.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.778 on 2590 degrees of freedom
## Multiple R-squared: 0.4868, Adjusted R-squared: 0.4862
## F-statistic: 818.8 on 3 and 2590 DF, p-value: < 2.2e-16
gdp_hyp <- seq(from = min(rossdata$GDPcap), to = max(rossdata$GDPcap),
length.out = 50)
pred4 <- predict(model_logGDP, newdata = data.frame(oil = median(oil_hyp),
GDPcap = gdp_hyp, oecd = 0), interval = "confidence") %>%
as.data.frame()
pred4$gdp <- gdp_hyp
Why was median used there?
Plotting the resulting expectations:
ggplot(pred4, aes(x = gdp)) + geom_line(aes(y = fit)) + geom_ribbon(aes(ymax = upr,
ymin = lwr), alpha = 1/3) + labs(x = "GDP per capita", y = "Expected level of democracy",
title = "Relationship of log(GDP per capita) and democracy")
Challenge: Transform oil by log. What happens? How would you solve it?
Log of 0 is undefined. This is often handled by adding and arbitrary small value to 0’s. In this case we’ll add 0.001. There are better ways to deal with this, but for now…
model_logoil <- lm(regime1 ~ GDPcap + log(oil_mod) + oecd, data = mutate(rossdata,
oil_mod = log(oil + 0.001)))
## Warning in log(oil_mod): NaNs produced
summary(model_logoil)
##
## Call:
## lm(formula = regime1 ~ GDPcap + log(oil_mod) + oecd, data = mutate(rossdata,
## oil_mod = log(oil + 0.001)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6337 -2.4525 -0.7552 1.2146 6.5659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.850e+00 1.434e-01 26.847 < 2e-16 ***
## GDPcap 2.708e-05 2.419e-05 1.119 0.263
## log(oil_mod) -7.896e-01 1.018e-01 -7.755 2.42e-14 ***
## oecd 5.420e+00 3.310e-01 16.373 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.976 on 887 degrees of freedom
## (1703 observations deleted due to missingness)
## Multiple R-squared: 0.4523, Adjusted R-squared: 0.4504
## F-statistic: 244.1 on 3 and 887 DF, p-value: < 2.2e-16
pred_logoil <- as.data.frame(predict(model_logoil, newdata = data.frame(oil_mod = seq(from = 0.01,
to = 100, by = 1), GDPcap = mean(rossdata$GDPcap), oecd = 0),
interval = "confidence"))
pred_logoil$oil <- seq(from = 0.01, to = 100, by = 1)
ggplot(pred_logoil, aes(x = oil)) + geom_line(aes(y = fit)) +
geom_ribbon(aes(ymax = upr, ymin = lwr), alpha = 1/3) + labs(x = "Oil production",
y = "Expected level of democracy", title = "Relationship of log(oil) and democracy")
Recall the default model,
model1
##
## Call:
## lm(formula = regime1 ~ GDPcap + oil + oecd + year, data = rossdata)
##
## Coefficients:
## (Intercept) GDPcap oil oecd year
## -1.736e+02 6.177e-05 -8.262e-02 4.116e+00 9.008e-02
We can use the functions residuals
and predict
to extract residuals and fitted values from an lm
object:
residuals(model1)
fitted(model1)
However, the function augment
easily returns these as .fit
and .resid
.
model1_aug <- augment(model1)
We can plot the fitted values vs. the outcome variable to look for nonlinearity,
ggplot(model1_aug, aes(x = regime1, y = .fitted)) + geom_point() +
geom_smooth() + ylab("E(Democracy | X)") + xlab("Democracy")
or the residuals vs. the fitted values to look for heteroskedasticity
ggplot(model1_aug, aes(x = .fitted, y = .resid)) + geom_point() +
geom_hline(yintercept = 0, colour = "red") + ylab("Residual") +
xlab("E(Democracy | X)")
Challenge: What accounts for the unusual patterns? hint: What values can Democracy take?
It is obvious in this case, but plotting the sqrt of the absolute value of the errors can make it more clear
ggplot(model1_aug, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
geom_point() + geom_hline(yintercept = 0, colour = "red") +
geom_smooth(se = FALSE) + ylab("Residual") + xlab("E(Democracy | X)")
Challenge: Plot residuals against each independent variable, and the fitted values. Is there evidence of heteroskedasticity or nonlinearity?
Answer:
ggplot(model1_aug, aes(x = regime1, y = .resid)) + geom_point() +
geom_smooth() + ylab("Residual") + xlab("Oil")
ggplot(model1_aug, aes(x = GDPcap, y = .resid)) + geom_point() +
geom_smooth() + ylab("Residual") + xlab("GDPcap")
Challenge: Compare residual plot for model with log(GDPcap)
. How have the residual plots changed?
Answer:
# model with logged GDP per capita
summary(model_logGDP)
##
## Call:
## lm(formula = regime1 ~ log(GDPcap) + oil + oecd, data = rossdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.169 -2.125 -0.077 2.010 7.539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.46656 0.42426 -12.88 <2e-16 ***
## log(GDPcap) 1.43853 0.05761 24.97 <2e-16 ***
## oil -0.11005 0.00470 -23.42 <2e-16 ***
## oecd 2.22692 0.15951 13.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.778 on 2590 degrees of freedom
## Multiple R-squared: 0.4868, Adjusted R-squared: 0.4862
## F-statistic: 818.8 on 3 and 2590 DF, p-value: < 2.2e-16
logGDP_aug <- augment(model_logGDP)
# look at how the augment dataframe updates to the
# transformed variables:
head(logGDP_aug)
## .rownames regime1 log.GDPcap. oil oecd .fitted .se.fit
## 1 48 1.5 7.440232 35.60206200 0 1.318336 0.15262838
## 2 49 5.0 7.440232 26.51153200 0 2.318776 0.11508216
## 3 86 5.0 7.722483 0.23792800 0 5.616292 0.07275067
## 4 87 8.0 7.638494 0.06946399 0 5.514012 0.07152078
## 5 92 1.5 9.709267 65.93863000 0 1.243784 0.27989124
## 6 93 1.5 9.687519 64.25789000 0 1.397469 0.27287186
## .resid .hat .sigma .cooksd .std.resid
## 1 0.1816644 0.0030183605 2.778646 3.246211e-06 0.06549021
## 2 2.6812241 0.0017159972 2.778148 4.009730e-04 0.96595374
## 3 -0.6162924 0.0006857639 2.778622 8.448597e-06 -0.22191468
## 4 2.4859883 0.0006627734 2.778218 1.328555e-04 0.89514485
## 5 0.2562156 0.0101503065 2.778644 2.202888e-05 0.09269819
## 6 0.1025306 0.0096475725 2.778648 3.349548e-06 0.03708591
ggplot(logGDP_aug, aes(x = log.GDPcap., y = .resid)) + geom_point() +
geom_smooth() + ylab("Residual") + xlab("log(GDPcap)")
What about the normality of errors?
ggplot(model1_aug, aes(sample = .std.resid)) + stat_qq() + geom_abline(slope = 1)
For more advanced residual plots see car functions:
avPlots
: added variable plotceresPlots
, crPlots
: Component + residual (partial residual) plotsresidualPlots
Also try using plot
on an lm
object:
plot(model1)
Alternatively, base graphics can be appropriate for quickly generating plots of residuals:
par(mfrow = c(2, 2))
plot(model1$fitted, model1$residuals) #what is this plot showing?
plot(rossdata$GDPcap, model1$residuals) #what is this plot showing?
plot(rossdata$oil, model1$residuals)
plot(rossdata$oecd, model1$residuals)
par(mfrow = c(1, 1))
We can calculate heteroskedasticity consistent (robust, White) standard errors using the car function hccm
:
The robust standard errors for the coefficients of model1
are:
sqrt(diag(hccm(model1)))
## (Intercept) GDPcap oil oecd year
## 1.744310e+01 1.710027e-05 4.213601e-03 1.848413e-01 8.813878e-03
Are they different than the classical standard errors of those in model1
?
To use the robust standard errors in a significance test supply the robust variance-covariance matrix to the coeftest
function from the lmtest package:
coeftest(model1, hccm(model1))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.7363e+02 1.7443e+01 -9.9542 < 2.2e-16 ***
## GDPcap 6.1766e-05 1.7100e-05 3.6120 0.0003096 ***
## oil -8.2617e-02 4.2136e-03 -19.6073 < 2.2e-16 ***
## oecd 4.1160e+00 1.8484e-01 22.2676 < 2.2e-16 ***
## year 9.0084e-02 8.8139e-03 10.2207 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(The answer to confidence intervals on first differences) We’re going to load the dataset on occupational prestige used by Fox in Chapter 5. Note: you can view the codebook here: http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Prestige.pdf
jobs <- read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Prestige.txt")
summary(jobs)
## education income women prestige
## Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
## 1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
## Median :10.540 Median : 5930 Median :13.600 Median :43.60
## Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
## 3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## census type
## Min. :1113 bc :44
## 1st Qu.:3120 prof:31
## Median :5135 wc :23
## Mean :5402 NA's: 4
## 3rd Qu.:8312
## Max. :9517
foxmodel <- lm(prestige ~ education + income + women, data = jobs)
summary(foxmodel)
##
## Call:
## lm(formula = prestige ~ education + income + women, data = jobs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8246 -5.3332 -0.1364 5.1587 17.5045
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.7943342 3.2390886 -2.098 0.0385 *
## education 4.1866373 0.3887013 10.771 < 2e-16 ***
## income 0.0013136 0.0002778 4.729 7.58e-06 ***
## women -0.0089052 0.0304071 -0.293 0.7702
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.846 on 98 degrees of freedom
## Multiple R-squared: 0.7982, Adjusted R-squared: 0.792
## F-statistic: 129.2 on 3 and 98 DF, p-value: < 2.2e-16
Now let’s extract the coefficients vector and the variance-covariance matrix:
pe <- coef(foxmodel)
vc <- vcov(foxmodel)
pe #look at them
## (Intercept) education income women
## -6.794334203 4.186637275 0.001313560 -0.008905157
vc
## (Intercept) education income women
## (Intercept) 10.4916952589 -0.9787550408 1.151261e-04 -5.548945e-03
## education -0.9787550408 0.1510886706 -7.290930e-05 -5.107514e-03
## income 0.0001151261 -0.0000729093 7.716239e-08 4.942693e-06
## women -0.0055489446 -0.0051075141 4.942693e-06 9.245892e-04
Suppose we want to calculate the predicted value of prestige for a job that requires 10 years of education, pays #$10,000 (in 1971 dollars), and is 50% female.
The long-winded way of doing it is:
predictedvalue <- pe[1] + 10 * pe[2] + 10000 * pe[3] + 50 * pe[4]
predictedvalue
## (Intercept)
## 47.76238
Remember: This is just a way of manually doing what predict()
does. A faster way of doing it uses matrix multiplication:
predictedvalue <- c(1, 10, 10000, 50) %*% pe
predictedvalue
## [,1]
## [1,] 47.76238
Let’s now try to model the uncertainty around this estimate. We can do this by taking random draws from the #distribution of possible beta values that our model has estimated.
library(MASS)
simbetas <- mvrnorm(1000, pe, vc) # this gives us 1000 estimates of the coefficient estimates
head(simbetas) # look at it
## (Intercept) education income women
## [1,] -7.733039 4.139561 0.0014942972 0.0002804371
## [2,] -3.126209 3.990875 0.0010343540 -0.0282165533
## [3,] -5.031826 4.199026 0.0012092777 -0.0339291174
## [4,] -9.528724 4.321738 0.0015280557 -0.0170317620
## [5,] -1.835650 4.027316 0.0009141161 -0.0202532893
## [6,] -10.981953 4.853358 0.0011204599 -0.0581665325
Examine the uncertainty around each coefficient (this is using base graphics for quick and dirty data exploration):
plot(density(simbetas[, 2]), col = "blue", lwd = 2, xlim = c(-1,
6), ylim = c(0, 20), main = "", xlab = "Coef Estimate") #initiating a plot with the first density
lines(density(simbetas[, 3]), col = "red", lwd = 2) # adding next coef
lines(density(simbetas[, 4]), col = "green", lwd = 2) # adding last coef
What’s going on with the red one? Examine coefficient for income more closely (it’s still normal):
plot(density(simbetas[, 3]), col = "red")
Now let’s calculate the predicted values of prestige for each of the 1000 possible coefficient estimates. We can do this most efficiently using matrix algebra:
xhyp <- c(1, 10, 10000, 50) # define our scenario: vector of hypothetical x values
predictions <- simbetas %*% xhyp
head(predictions) # look at it
## [,1]
## [1,] 48.61957
## [2,] 45.71525
## [3,] 47.35475
## [4,] 48.11763
## [5,] 46.56600
## [6,] 45.84790
Now let’s look at the distribution of predicted values:
hist(predictions)
plot(density(predictions), main = "Predicted Prestige", lwd = 2,
col = "blue", xlab = "Prestige", ylab = "Relative Frequency")
We can calculate the mean of this distribution, and the lower and upper 95% confidence intervals like this:
EV <- mean(predictions)
lowerCI <- quantile(predictions, probs = 0.025)
upperCI <- quantile(predictions, probs = 0.975)
Compare the EV
to the predictedvalue
above:
EV
## [1] 47.87825
predictedvalue # you'll see that it's roughly the same, but not exactly. Why not?
## [,1]
## [1,] 47.76238
You can also compare these estimates to those obtained using the function predict().
predict.lm(foxmodel, newdata = list(education = 10, income = 10000,
women = 50), interval = "confidence")
## fit lwr upr
## 1 47.76238 44.29432 51.23045
EV
## [1] 47.87825
lowerCI
## 2.5%
## 44.69479
upperCI
## 97.5%
## 51.24598
Let’s add lines to the graph indicating the mean and 95% CIs:
plot(density(predictions), main = "Predicted Prestige", lwd = 2,
col = "blue", xlab = "Prestige", ylab = "Relative Frequency")
abline(v = EV, col = "red")
abline(v = c(lowerCI, upperCI), lty = 2)
Now let’s estimate the distribution of prestige values for a job that is the same in every respect but pays $12,000 #instead of $10,000:
xhyp2 <- c(1, 10, 12000, 50) # define our scenario vector
predictions2 <- simbetas %*% xhyp2 # multiply by the simulated coefficients
plot(density(predictions), main = "Predicted Prestige", lwd = 3,
col = "blue", bty = "n", xlab = "Prestige", ylab = "Relative Frequency",
xlim = c(40, 60))
lines(density(predictions2), lwd = 3, col = "red")
legend(53, 0.2, c("$10k", "$12k"), lwd = 3, col = c("blue", "red"),
cex = 0.8, bty = "n")
Question: Why are we using density plots instead of a histogram? Question: Why does the $12k distribution have greater variation than the $10k one?
Hint: look at the mean of the income variable.
Suppose we want to examine the effect of changes in income on prestige over a continuous range of values - i.e., not just the $10k and $12k scenarios we looked at above.
Create the range of X for which you want to explore the expectations for y, in a dataframe with placeholders for the results (Note we’re using “length.out” instead of “by” in seq
):
scenarios <- data_frame(income = seq(from = min(jobs$income),
to = max(jobs$income), length.out = 10), EVs = NA, lowerCIs = NA,
upperCIs = NA)
Then get the coefficients and vcmatrix from the model:
pe <- coef(foxmodel)
vc <- vcov(foxmodel)
Take 1000 draws from the multivariate normal distribution:
simbetas <- mvrnorm(1000, pe, vc)
Set up a loop that goes through all 10 values in the income.range vector:
for (i in 1:nrow(scenarios)) {
# define the covariate values for the current scenario
xhyp <- c(1, 10, scenarios$income[i], 50) #make sure you understand what these numbers represent
# calculate the 1000 predictions:
yhyp <- simbetas %*% xhyp
# Now slot the mean, lowerCI and upperCI values into the
# appropriate positions in the results vectors:
scenarios$EVs[i] <- mean(yhyp)
scenarios$lowerCIs[i] <- sort(yhyp)[25]
scenarios$upperCIs[i] <- sort(yhyp)[975]
} # close i loop
So, now that we’ve got the results, we can look at them :
scenarios
## Source: local data frame [10 x 4]
##
## income EVs lowerCIs upperCIs
## (dbl) (dbl) (dbl) (dbl)
## 1 611.000 35.42838 32.41529 38.17948
## 2 3418.556 39.09854 37.06306 41.00832
## 3 6226.111 42.76870 40.75085 44.73954
## 4 9033.667 46.43885 43.39157 49.40357
## 5 11841.222 50.10901 45.66956 54.45994
## 6 14648.778 53.77917 48.01691 59.42830
## 7 17456.333 57.44933 50.19847 64.62338
## 8 20263.889 61.11949 52.43290 69.64787
## 9 23071.444 64.78965 54.60657 74.84752
## 10 25879.000 68.45981 56.74642 80.05005
But of course it’s much better to plot them:
ggplot(scenarios, aes(x = income)) + geom_line(aes(y = EVs)) +
geom_ribbon(aes(ymin = lowerCIs, ymax = upperCIs), alpha = 1/3) +
labs(x = "Income", y = "Expected Prestige", title = "Relationship of Income and Prestige, 50% Women & 10 Years of Education")
foxmodel2 <- lm(prestige ~ education + income * women, data = jobs)
summary(foxmodel2)
##
## Call:
## lm(formula = prestige ~ education + income * women, data = jobs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4623 -5.1970 -0.4228 5.2365 16.8228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.760e+00 3.698e+00 -1.017 0.312
## education 3.883e+00 4.268e-01 9.098 1.19e-14 ***
## income 1.236e-03 2.793e-04 4.426 2.51e-05 ***
## women -8.692e-02 5.599e-02 -1.552 0.124
## income:women 2.179e-05 1.318e-05 1.653 0.102
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.778 on 97 degrees of freedom
## Multiple R-squared: 0.8037, Adjusted R-squared: 0.7956
## F-statistic: 99.29 on 4 and 97 DF, p-value: < 2.2e-16
What do these results mean? Let’s use first differences
Get the coefficients and vcmatrix from the model:
pe2 <- coef(foxmodel2)
vc2 <- vcov(foxmodel2)
Take 1000 draws from the multivariate normal distribution:
simbetas2 <- mvrnorm(1000, pe2, vc2)
Set up range of women employed in that occupation, and define the high and low values of income:
women_range <- seq(from = 0, to = 100, by = 10)
income_hi <- mean(jobs$income) + sd(jobs$income)
income_lo <- mean(jobs$income) - sd(jobs$income)
Create placeholders for the results:
scenarios_women <- data_frame(women = women_range, fd_mean = NA,
fd_lowerCI = NA, fd_upperCI = NA)
Set up a loop that goes through all 11 values in the women_range vector:
for (i in 1:nrow(scenarios_women)) {
# define the covariate values for the current scenario
x.income_hi <- c(1, 10, income_hi, women_range[i], income_hi *
women_range[i])
# make sure you understand where these numbers are coming
# from
x.income_lo <- c(1, 10, income_lo, women_range[i], income_lo *
women_range[i])
firstdifferences <- (simbetas2 %*% x.income_hi) - (simbetas2 %*%
x.income_lo)
scenarios_women$fd_mean[i] <- mean(firstdifferences)
scenarios_women$fd_lowerCI[i] <- quantile(firstdifferences,
probs = 0.025)
scenarios_women$fd_upperCI[i] <- quantile(firstdifferences,
probs = 0.975) #why these numbers instead of .5 and .95?
} # close i loop
Examine the results in a table:
scenarios_women
## Source: local data frame [11 x 4]
##
## women fd_mean fd_lowerCI fd_upperCI
## (dbl) (dbl) (dbl) (dbl)
## 1 0 10.56958 5.743428 15.26759
## 2 10 12.39355 7.797728 17.21062
## 3 20 14.21751 8.613678 19.95399
## 4 30 16.04147 8.989788 23.35442
## 5 40 17.86543 9.016134 27.09106
## 6 50 19.68939 8.726005 30.69773
## 7 60 21.51335 8.609789 34.31999
## 8 70 23.33731 8.135395 38.35044
## 9 80 25.16127 7.935264 42.36954
## 10 90 26.98523 7.603087 46.18401
## 11 100 28.80919 7.402275 50.09452
But of course it’s much better to plot them:
ggplot(scenarios_women, aes(x = women)) + geom_line(aes(y = fd_mean)) +
geom_ribbon(aes(ymin = fd_lowerCI, ymax = fd_upperCI), alpha = 1/3) +
labs(x = "Percentage of Individuals in an Occupation who are Women",
y = "Expected Increase in Prestige", title = "Effect of a change in income from 1 s.d. below to 1 s.d. above mean")
Note: Chris has a package called simcf that does this and integrates well with his package tile for plotting - more detail on this later
Challenge
Return to the first model we ran on the occupational prestige data.
foxmodel <- lm(prestige ~ education + income + women, data = jobs)
summary(foxmodel)
##
## Call:
## lm(formula = prestige ~ education + income + women, data = jobs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8246 -5.3332 -0.1364 5.1587 17.5045
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.7943342 3.2390886 -2.098 0.0385 *
## education 4.1866373 0.3887013 10.771 < 2e-16 ***
## income 0.0013136 0.0002778 4.729 7.58e-06 ***
## women -0.0089052 0.0304071 -0.293 0.7702
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.846 on 98 degrees of freedom
## Multiple R-squared: 0.7982, Adjusted R-squared: 0.792
## F-statistic: 129.2 on 3 and 98 DF, p-value: < 2.2e-16
For each independent variable in the model, calculate the expected change in prestige that results from a change in that independent variable from one SD below its mean to one SD above its mean, along with 95 percent confidence intervals around this quantity.