POLS/CSSS 503, University of Washington, Spring 2015

More on Interpretation
First Differences
Transformations
Post-estimation Diagnostics
- Robust standard errors
Using simulation to compute expected values and confidence intervals:
- Plotting Expected Values
- Illustrating interaction terms with a first differences plot – including confidence intervals!
Sources

As always, start the lab by loading the packages we will be using.

library("car")
library("lmtest")
library("MASS")
library("dplyr")
library("ggplot2")
library("broom")

More on Interpretation

Once more we’ll use data from Ross on oil, this time from the earlier 2001 paper.

rossdata_url <- "http://staff.washington.edu/csjohns/503/rossoildata.csv"
rossdata_raw <- read.csv(rossdata_url, stringsAsFactors = FALSE)
rossdata <- rossdata_raw %>% select(cty_name, year, regime1, 
  oil, GDPcap, oecd) %>% na.omit()

Let’s run the standard model from Ross for the relationship between oil and GDP.

model1 <- lm(regime1 ~ GDPcap + oil + oecd + year, data = rossdata)
summary(model1)

## 
## Call:
## lm(formula = regime1 ~ GDPcap + oil + oecd + year, data = rossdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8284 -2.5065 -0.2086  2.2014  8.0464 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.736e+02  1.664e+01 -10.433  < 2e-16 ***
## GDPcap       6.177e-05  1.823e-05   3.387 0.000716 ***
## oil         -8.262e-02  5.022e-03 -16.450  < 2e-16 ***
## oecd         4.116e+00  1.928e-01  21.346  < 2e-16 ***
## year         9.008e-02  8.413e-03  10.708  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.932 on 2589 degrees of freedom
## Multiple R-squared:  0.4285, Adjusted R-squared:  0.4276 
## F-statistic: 485.2 on 4 and 2589 DF,  p-value: < 2.2e-16

Now predict the expected value of regime type vs. oil:

pred1_df <- data.frame(oil = seq(0, 100, by = 10), GDPcap = mean(rossdata$GDPcap), 
  oecd = 0, year = median(rossdata$year))
pred1 <- predict(model1, newdata = pred1_df, interval = "confidence")

ggplot(bind_cols(pred1_df, as.data.frame(pred1)), aes(x = oil, 
  y = fit, ymin = lwr, ymax = upr)) + geom_ribbon(alpha = 0.2) + 
  geom_line()

This is easiest to plot with augment instead

pred1_aug <- augment(model1, newdata = pred1_df)
ggplot(pred1_aug, aes(x = oil, y = .fitted, ymin = .fitted + 
  2 * .se.fit, ymax = .fitted - 2 * .se.fit)) + geom_ribbon(alpha = 0.2) + 
  geom_line()

Challenge: Do this with both OECD and non-OECD countries. Before plotting, how will these differ?

Answer:

pred2_df <- expand.grid(oil = seq(0, 100, by = 10), GDPcap = mean(rossdata$GDPcap), 
  oecd = c(0, 1), year = median(rossdata$year))
pred2 <- predict(model1, newdata = pred2_df, interval = "confidence")
ggplot(bind_cols(pred2_df, as.data.frame(pred2)), aes(x = oil, 
  y = fit, ymin = lwr, ymax = upr, colour = factor(oecd), fill = factor(oecd))) + 
  geom_ribbon(alpha = 0.2, colour = NA) + geom_line()

Let’s plot with augment instead

pred1_aug <- augment(model1, newdata = pred2_df)
ggplot(pred1_aug, aes(x = oil, y = .fitted, ymin = .fitted + 
  2 * .se.fit, ymax = .fitted - 2 * .se.fit, fill = factor(oecd), 
  colour = factor(oecd))) + geom_ribbon(alpha = 0.2, colour = NA) + 
  geom_line()

Challenge: Does it make sense to plot oil from 0 to 100? Check the range of oil.

Answer: Surprisingly, not that bad.

ggplot(rossdata, aes(x = oil)) + geom_density() + geom_rug()

Challenge: Interact oil with oecd and plot. What will the lines look like?

model2 <- lm(regime1 ~ GDPcap + oil * oecd + year, data = rossdata)
summary(model2)

## 
## Call:
## lm(formula = regime1 ~ GDPcap + oil * oecd + year, data = rossdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7411 -2.5075 -0.1755  2.2446  8.0701 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.732e+02  1.663e+01 -10.410  < 2e-16 ***
## GDPcap       5.931e-05  1.826e-05   3.248  0.00118 ** 
## oil         -8.344e-02  5.035e-03 -16.571  < 2e-16 ***
## oecd         4.005e+00  2.003e-01  19.998  < 2e-16 ***
## year         8.985e-02  8.408e-03  10.686  < 2e-16 ***
## oil:oecd     8.863e-02  4.340e-02   2.042  0.04124 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.93 on 2588 degrees of freedom
## Multiple R-squared:  0.4294, Adjusted R-squared:  0.4283 
## F-statistic: 389.5 on 5 and 2588 DF,  p-value: < 2.2e-16

Now consider interact oil with year. How would you plot two continuous variables?

model3 <- lm(regime1 ~ GDPcap + oecd + oil * year, data = rossdata)
summary(model3)

## 
## Call:
## lm(formula = regime1 ~ GDPcap + oecd + oil * year, data = rossdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7659 -2.4662 -0.2211  2.2458  7.5283 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.900e+02  1.721e+01 -11.039  < 2e-16 ***
## GDPcap       6.228e-05  1.819e-05   3.424 0.000628 ***
## oecd         4.108e+00  1.924e-01  21.351  < 2e-16 ***
## oil          4.400e+00  1.242e+00   3.543 0.000403 ***
## year         9.835e-02  8.700e-03  11.304  < 2e-16 ***
## oil:year    -2.263e-03  6.269e-04  -3.609 0.000313 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.925 on 2588 degrees of freedom
## Multiple R-squared:  0.4313, Adjusted R-squared:  0.4302 
## F-statistic: 392.6 on 5 and 2588 DF,  p-value: < 2.2e-16

In order to interpret the interaction term graphically, we need to pick one variable to treat as continuous — the x-axis — and another for which we will choose specific values. Use oil as the x-axis, and some values of year at which to evaluate it. For year we could use the min, max, and median.

pred3_newdata <- expand.grid(oecd = 0, GDPcap = mean(rossdata$GDPcap), 
  year = c(1966, 1981, 1997), oil = seq(0, 100, by = 10))
pred3 <- augment(model3, newdata = pred3_newdata)

Now, let’s plot it,

ggplot(pred3, aes(x = oil, y = .fitted, ymin = .fitted - 2 * 
  .se.fit, ymax = .fitted + 2 * .se.fit, )) + geom_line(aes(colour = factor(year))) + 
  geom_ribbon(aes(fill = factor(year)), alpha = 1/3) + scale_colour_discrete("year") + 
  scale_fill_discrete("year") + labs(x = "Percent oil", y = "Expected level of democracy", 
  title = "Changing relationship of oil and democracy")

Challenge Interact GDPpc with oil and visualize the interactions.

First Differences

A common way of comparing the effect of a variable is to compare the results on $\hat{y}$ for a variable at its mean and at its mean plus one standard deviation, holding all other variables at their means (if continuous), and their

year_seq <- seq(min(rossdata$year), max(rossdata$year), by = 1)
oil_lo <- mean(rossdata$GDPcap)
oil_hi <- mean(rossdata$GDPcap) + sd(rossdata$GDPcap)
pred3_oil_lo <- predict(model3, newdata = data.frame(year = year_seq, 
  GDPcap = mean(rossdata$GDPcap), oil = oil_lo, oecd = 0), 
  interval = "confidence")

pred3_oil_hi <- predict(model3, newdata = data.frame(year = year_seq, 
  GDPcap = mean(rossdata$GDPcap), oil = oil_hi, oecd = 0), 
  interval = "confidence")

The first difference is calculated. Note that there are no standard error associated with these differences.

pred3_diff <- data.frame(year = year_seq, diff = pred3_oil_hi[, 
  "fit"] - pred3_oil_lo[, "fit"])
head(pred3_diff)

##   year      diff
## 1 1966 -259.0802
## 2 1967 -271.1370
## 3 1968 -283.1938
## 4 1969 -295.2507
## 5 1970 -307.3075
## 6 1971 -319.3643

We could plot these differences over time.

ggplot(pred3_diff, aes(x = year, y = diff)) + geom_point() + 
  geom_hline(yintercept = 0, colour = "red") + labs(x = "Year", 
  y = "Change in democracy", title = "Change in democracy from high to low oil, over time")

Challenge:

What does this plot tell us about the relationship between oil and democracy over time?
What does this provide that the previous plot does not?
How would you you calculate confidence intervals around these estimates?

Transformations

What if we suspect the effect of oil is non-linear?

Specifying a model:

model4 <- lm(regime1 ~ GDPcap + oil + I(oil^2) + oecd, data = rossdata)
summary(model4)

## 
## Call:
## lm(formula = regime1 ~ GDPcap + oil + I(oil^2) + oecd, data = rossdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6194 -2.7171 -0.1169  2.1843  7.8092 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.593e+00  8.301e-02  55.329   <2e-16 ***
## GDPcap       1.878e-04  1.425e-05  13.180   <2e-16 ***
## oil         -1.084e-01  1.213e-02  -8.935   <2e-16 ***
## I(oil^2)     2.691e-04  1.941e-04   1.387    0.166    
## oecd         3.171e+00  1.759e-01  18.034   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.995 on 2589 degrees of freedom
## Multiple R-squared:  0.4036, Adjusted R-squared:  0.4027 
## F-statistic:   438 on 4 and 2589 DF,  p-value: < 2.2e-16

oil_hyp <- seq(from = 0.01, to = 100, by = 10)
pred4 <- predict(model4, newdata = data.frame(oil = oil_hyp, 
  GDPcap = mean(rossdata$GDPcap), oecd = 0), interval = "confidence") %>% 
  as.data.frame()
pred4$oil <- oil_hyp

Plotting the resulting expectations:

ggplot(pred4, aes(x = oil)) + geom_line(aes(y = fit)) + geom_ribbon(aes(ymax = upr, 
  ymin = lwr), alpha = 1/3) + labs(x = "Percent oil", y = "Expected level of democracy", 
  title = "Relationship of oil and democracy, with quadratic transformation of oil")

Challenge: Transform GDP per capita by log.

Answer: Specifying a model:

model_logGDP <- lm(regime1 ~ log(GDPcap) + oil + oecd, data = rossdata)
summary(model_logGDP)

## 
## Call:
## lm(formula = regime1 ~ log(GDPcap) + oil + oecd, data = rossdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.169 -2.125 -0.077  2.010  7.539 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.46656    0.42426  -12.88   <2e-16 ***
## log(GDPcap)  1.43853    0.05761   24.97   <2e-16 ***
## oil         -0.11005    0.00470  -23.42   <2e-16 ***
## oecd         2.22692    0.15951   13.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.778 on 2590 degrees of freedom
## Multiple R-squared:  0.4868, Adjusted R-squared:  0.4862 
## F-statistic: 818.8 on 3 and 2590 DF,  p-value: < 2.2e-16

gdp_hyp <- seq(from = min(rossdata$GDPcap), to = max(rossdata$GDPcap), 
  length.out = 50)
pred4 <- predict(model_logGDP, newdata = data.frame(oil = median(oil_hyp), 
  GDPcap = gdp_hyp, oecd = 0), interval = "confidence") %>% 
  as.data.frame()
pred4$gdp <- gdp_hyp

Why was median used there?

Plotting the resulting expectations:

ggplot(pred4, aes(x = gdp)) + geom_line(aes(y = fit)) + geom_ribbon(aes(ymax = upr, 
  ymin = lwr), alpha = 1/3) + labs(x = "GDP per capita", y = "Expected level of democracy", 
  title = "Relationship of log(GDP per capita) and democracy")

Challenge: Transform oil by log. What happens? How would you solve it?

Log of 0 is undefined. This is often handled by adding and arbitrary small value to 0’s. In this case we’ll add 0.001. There are better ways to deal with this, but for now…

model_logoil <- lm(regime1 ~ GDPcap + log(oil_mod) + oecd, data = mutate(rossdata, 
  oil_mod = log(oil + 0.001)))

## Warning in log(oil_mod): NaNs produced

summary(model_logoil)

## 
## Call:
## lm(formula = regime1 ~ GDPcap + log(oil_mod) + oecd, data = mutate(rossdata, 
##     oil_mod = log(oil + 0.001)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6337  -2.4525  -0.7552   1.2146   6.5659 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.850e+00  1.434e-01  26.847  < 2e-16 ***
## GDPcap        2.708e-05  2.419e-05   1.119    0.263    
## log(oil_mod) -7.896e-01  1.018e-01  -7.755 2.42e-14 ***
## oecd          5.420e+00  3.310e-01  16.373  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.976 on 887 degrees of freedom
##   (1703 observations deleted due to missingness)
## Multiple R-squared:  0.4523, Adjusted R-squared:  0.4504 
## F-statistic: 244.1 on 3 and 887 DF,  p-value: < 2.2e-16

pred_logoil <- as.data.frame(predict(model_logoil, newdata = data.frame(oil_mod = seq(from = 0.01, 
  to = 100, by = 1), GDPcap = mean(rossdata$GDPcap), oecd = 0), 
  interval = "confidence"))
pred_logoil$oil <- seq(from = 0.01, to = 100, by = 1)

ggplot(pred_logoil, aes(x = oil)) + geom_line(aes(y = fit)) + 
  geom_ribbon(aes(ymax = upr, ymin = lwr), alpha = 1/3) + labs(x = "Oil production", 
  y = "Expected level of democracy", title = "Relationship of log(oil) and democracy")

Post-estimation Diagnostics

Recall the default model,

model1

## 
## Call:
## lm(formula = regime1 ~ GDPcap + oil + oecd + year, data = rossdata)
## 
## Coefficients:
## (Intercept)       GDPcap          oil         oecd         year  
##  -1.736e+02    6.177e-05   -8.262e-02    4.116e+00    9.008e-02

We can use the functions residuals and predict to extract residuals and fitted values from an lm object:

residuals(model1)
fitted(model1)

However, the function augment easily returns these as .fit and .resid.

model1_aug <- augment(model1)

We can plot the fitted values vs. the outcome variable to look for nonlinearity,

ggplot(model1_aug, aes(x = regime1, y = .fitted)) + geom_point() + 
  geom_smooth() + ylab("E(Democracy | X)") + xlab("Democracy")

or the residuals vs. the fitted values to look for heteroskedasticity

ggplot(model1_aug, aes(x = .fitted, y = .resid)) + geom_point() + 
  geom_hline(yintercept = 0, colour = "red") + ylab("Residual") + 
  xlab("E(Democracy | X)")

Challenge: What accounts for the unusual patterns? hint: What values can Democracy take?

It is obvious in this case, but plotting the sqrt of the absolute value of the errors can make it more clear

ggplot(model1_aug, aes(x = .fitted, y = sqrt(abs(.std.resid)))) + 
  geom_point() + geom_hline(yintercept = 0, colour = "red") + 
  geom_smooth(se = FALSE) + ylab("Residual") + xlab("E(Democracy | X)")

Challenge: Plot residuals against each independent variable, and the fitted values. Is there evidence of heteroskedasticity or nonlinearity?

Answer:

ggplot(model1_aug, aes(x = regime1, y = .resid)) + geom_point() + 
  geom_smooth() + ylab("Residual") + xlab("Oil")

ggplot(model1_aug, aes(x = GDPcap, y = .resid)) + geom_point() + 
  geom_smooth() + ylab("Residual") + xlab("GDPcap")

Challenge: Compare residual plot for model with log(GDPcap). How have the residual plots changed?

Answer:

# model with logged GDP per capita
summary(model_logGDP)

## 
## Call:
## lm(formula = regime1 ~ log(GDPcap) + oil + oecd, data = rossdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.169 -2.125 -0.077  2.010  7.539 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.46656    0.42426  -12.88   <2e-16 ***
## log(GDPcap)  1.43853    0.05761   24.97   <2e-16 ***
## oil         -0.11005    0.00470  -23.42   <2e-16 ***
## oecd         2.22692    0.15951   13.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.778 on 2590 degrees of freedom
## Multiple R-squared:  0.4868, Adjusted R-squared:  0.4862 
## F-statistic: 818.8 on 3 and 2590 DF,  p-value: < 2.2e-16

logGDP_aug <- augment(model_logGDP)

# look at how the augment dataframe updates to the
# transformed variables:
head(logGDP_aug)

##   .rownames regime1 log.GDPcap.         oil oecd  .fitted    .se.fit
## 1        48     1.5    7.440232 35.60206200    0 1.318336 0.15262838
## 2        49     5.0    7.440232 26.51153200    0 2.318776 0.11508216
## 3        86     5.0    7.722483  0.23792800    0 5.616292 0.07275067
## 4        87     8.0    7.638494  0.06946399    0 5.514012 0.07152078
## 5        92     1.5    9.709267 65.93863000    0 1.243784 0.27989124
## 6        93     1.5    9.687519 64.25789000    0 1.397469 0.27287186
##       .resid         .hat   .sigma      .cooksd  .std.resid
## 1  0.1816644 0.0030183605 2.778646 3.246211e-06  0.06549021
## 2  2.6812241 0.0017159972 2.778148 4.009730e-04  0.96595374
## 3 -0.6162924 0.0006857639 2.778622 8.448597e-06 -0.22191468
## 4  2.4859883 0.0006627734 2.778218 1.328555e-04  0.89514485
## 5  0.2562156 0.0101503065 2.778644 2.202888e-05  0.09269819
## 6  0.1025306 0.0096475725 2.778648 3.349548e-06  0.03708591

ggplot(logGDP_aug, aes(x = log.GDPcap., y = .resid)) + geom_point() + 
  geom_smooth() + ylab("Residual") + xlab("log(GDPcap)")

What about the normality of errors?

ggplot(model1_aug, aes(sample = .std.resid)) + stat_qq() + geom_abline(slope = 1)

For more advanced residual plots see car functions:

avPlots: added variable plot
ceresPlots, crPlots: Component + residual (partial residual) plots
residualPlots

Also try using plot on an lm object:

plot(model1)

Alternatively, base graphics can be appropriate for quickly generating plots of residuals:

par(mfrow = c(2, 2))
plot(model1$fitted, model1$residuals)  #what is this plot showing?
plot(rossdata$GDPcap, model1$residuals)  #what is this plot showing?
plot(rossdata$oil, model1$residuals)
plot(rossdata$oecd, model1$residuals)

par(mfrow = c(1, 1))

Robust standard errors

We can calculate heteroskedasticity consistent (robust, White) standard errors using the car function hccm:

The robust standard errors for the coefficients of model1 are:

sqrt(diag(hccm(model1)))

##  (Intercept)       GDPcap          oil         oecd         year 
## 1.744310e+01 1.710027e-05 4.213601e-03 1.848413e-01 8.813878e-03

Are they different than the classical standard errors of those in model1?

To use the robust standard errors in a significance test supply the robust variance-covariance matrix to the coeftest function from the lmtest package:

coeftest(model1, hccm(model1))

## 
## t test of coefficients:
## 
##                Estimate  Std. Error  t value  Pr(>|t|)    
## (Intercept) -1.7363e+02  1.7443e+01  -9.9542 < 2.2e-16 ***
## GDPcap       6.1766e-05  1.7100e-05   3.6120 0.0003096 ***
## oil         -8.2617e-02  4.2136e-03 -19.6073 < 2.2e-16 ***
## oecd         4.1160e+00  1.8484e-01  22.2676 < 2.2e-16 ***
## year         9.0084e-02  8.8139e-03  10.2207 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using simulation to compute expected values and confidence intervals:

(The answer to confidence intervals on first differences) We’re going to load the dataset on occupational prestige used by Fox in Chapter 5. Note: you can view the codebook here: http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Prestige.pdf

jobs <- read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Prestige.txt")
summary(jobs)

##    education          income          women           prestige    
##  Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
##  1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
##  Median :10.540   Median : 5930   Median :13.600   Median :43.60  
##  Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
##  3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
##  Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :44  
##  1st Qu.:3120   prof:31  
##  Median :5135   wc  :23  
##  Mean   :5402   NA's: 4  
##  3rd Qu.:8312            
##  Max.   :9517

foxmodel <- lm(prestige ~ education + income + women, data = jobs)
summary(foxmodel)

## 
## Call:
## lm(formula = prestige ~ education + income + women, data = jobs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8246  -5.3332  -0.1364   5.1587  17.5045 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.7943342  3.2390886  -2.098   0.0385 *  
## education    4.1866373  0.3887013  10.771  < 2e-16 ***
## income       0.0013136  0.0002778   4.729 7.58e-06 ***
## women       -0.0089052  0.0304071  -0.293   0.7702    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.846 on 98 degrees of freedom
## Multiple R-squared:  0.7982, Adjusted R-squared:  0.792 
## F-statistic: 129.2 on 3 and 98 DF,  p-value: < 2.2e-16

Now let’s extract the coefficients vector and the variance-covariance matrix:

pe <- coef(foxmodel)
vc <- vcov(foxmodel)

pe  #look at them

##  (Intercept)    education       income        women 
## -6.794334203  4.186637275  0.001313560 -0.008905157

vc

##               (Intercept)     education        income         women
## (Intercept) 10.4916952589 -0.9787550408  1.151261e-04 -5.548945e-03
## education   -0.9787550408  0.1510886706 -7.290930e-05 -5.107514e-03
## income       0.0001151261 -0.0000729093  7.716239e-08  4.942693e-06
## women       -0.0055489446 -0.0051075141  4.942693e-06  9.245892e-04

Suppose we want to calculate the predicted value of prestige for a job that requires 10 years of education, pays #$10,000 (in 1971 dollars), and is 50% female.

The long-winded way of doing it is:

predictedvalue <- pe[1] + 10 * pe[2] + 10000 * pe[3] + 50 * pe[4]

predictedvalue

## (Intercept) 
##    47.76238

Remember: This is just a way of manually doing what predict() does. A faster way of doing it uses matrix multiplication:

predictedvalue <- c(1, 10, 10000, 50) %*% pe
predictedvalue

##          [,1]
## [1,] 47.76238

Let’s now try to model the uncertainty around this estimate. We can do this by taking random draws from the #distribution of possible beta values that our model has estimated.

library(MASS)
simbetas <- mvrnorm(1000, pe, vc)  # this gives us 1000 estimates of the coefficient estimates
head(simbetas)  # look at it

##      (Intercept) education       income         women
## [1,]   -7.733039  4.139561 0.0014942972  0.0002804371
## [2,]   -3.126209  3.990875 0.0010343540 -0.0282165533
## [3,]   -5.031826  4.199026 0.0012092777 -0.0339291174
## [4,]   -9.528724  4.321738 0.0015280557 -0.0170317620
## [5,]   -1.835650  4.027316 0.0009141161 -0.0202532893
## [6,]  -10.981953  4.853358 0.0011204599 -0.0581665325

Examine the uncertainty around each coefficient (this is using base graphics for quick and dirty data exploration):

plot(density(simbetas[, 2]), col = "blue", lwd = 2, xlim = c(-1, 
  6), ylim = c(0, 20), main = "", xlab = "Coef Estimate")  #initiating a plot with the first density
lines(density(simbetas[, 3]), col = "red", lwd = 2)  # adding next coef
lines(density(simbetas[, 4]), col = "green", lwd = 2)  # adding last coef

What’s going on with the red one? Examine coefficient for income more closely (it’s still normal):

plot(density(simbetas[, 3]), col = "red")

Now let’s calculate the predicted values of prestige for each of the 1000 possible coefficient estimates. We can do this most efficiently using matrix algebra:

xhyp <- c(1, 10, 10000, 50)  # define our scenario: vector of hypothetical x values
predictions <- simbetas %*% xhyp
head(predictions)  # look at it

##          [,1]
## [1,] 48.61957
## [2,] 45.71525
## [3,] 47.35475
## [4,] 48.11763
## [5,] 46.56600
## [6,] 45.84790

Now let’s look at the distribution of predicted values:

hist(predictions)

plot(density(predictions), main = "Predicted Prestige", lwd = 2, 
  col = "blue", xlab = "Prestige", ylab = "Relative Frequency")

We can calculate the mean of this distribution, and the lower and upper 95% confidence intervals like this:

EV <- mean(predictions)
lowerCI <- quantile(predictions, probs = 0.025)
upperCI <- quantile(predictions, probs = 0.975)

Compare the EV to the predictedvalue above:

EV

## [1] 47.87825

predictedvalue  # you'll see that it's roughly the same, but not exactly.  Why not?

##          [,1]
## [1,] 47.76238

You can also compare these estimates to those obtained using the function predict().

predict.lm(foxmodel, newdata = list(education = 10, income = 10000, 
  women = 50), interval = "confidence")

##        fit      lwr      upr
## 1 47.76238 44.29432 51.23045

EV

## [1] 47.87825

lowerCI

##     2.5% 
## 44.69479

upperCI

##    97.5% 
## 51.24598

Let’s add lines to the graph indicating the mean and 95% CIs:

plot(density(predictions), main = "Predicted Prestige", lwd = 2, 
  col = "blue", xlab = "Prestige", ylab = "Relative Frequency")
abline(v = EV, col = "red")
abline(v = c(lowerCI, upperCI), lty = 2)

Now let’s estimate the distribution of prestige values for a job that is the same in every respect but pays $12,000 #instead of $10,000:

xhyp2 <- c(1, 10, 12000, 50)  # define our scenario vector
predictions2 <- simbetas %*% xhyp2  # multiply by the simulated coefficients
plot(density(predictions), main = "Predicted Prestige", lwd = 3, 
  col = "blue", bty = "n", xlab = "Prestige", ylab = "Relative Frequency", 
  xlim = c(40, 60))
lines(density(predictions2), lwd = 3, col = "red")
legend(53, 0.2, c("$10k", "$12k"), lwd = 3, col = c("blue", "red"), 
  cex = 0.8, bty = "n")

Question: Why are we using density plots instead of a histogram? Question: Why does the $12k distribution have greater variation than the $10k one?
Hint: look at the mean of the income variable.

Plotting Expected Values

Suppose we want to examine the effect of changes in income on prestige over a continuous range of values - i.e., not just the $10k and $12k scenarios we looked at above.

Create the range of X for which you want to explore the expectations for y, in a dataframe with placeholders for the results (Note we’re using “length.out” instead of “by” in seq):

scenarios <- data_frame(income = seq(from = min(jobs$income), 
  to = max(jobs$income), length.out = 10), EVs = NA, lowerCIs = NA, 
  upperCIs = NA)

Then get the coefficients and vcmatrix from the model:

pe <- coef(foxmodel)
vc <- vcov(foxmodel)

Take 1000 draws from the multivariate normal distribution:

simbetas <- mvrnorm(1000, pe, vc)

Set up a loop that goes through all 10 values in the income.range vector:

for (i in 1:nrow(scenarios)) {
  
  # define the covariate values for the current scenario
  xhyp <- c(1, 10, scenarios$income[i], 50)  #make sure you understand what these numbers represent
  
  # calculate the 1000 predictions:
  yhyp <- simbetas %*% xhyp
  
  # Now slot the mean, lowerCI and upperCI values into the
  # appropriate positions in the results vectors:
  scenarios$EVs[i] <- mean(yhyp)
  scenarios$lowerCIs[i] <- sort(yhyp)[25]
  scenarios$upperCIs[i] <- sort(yhyp)[975]
  
}  # close i loop

So, now that we’ve got the results, we can look at them :

scenarios

## Source: local data frame [10 x 4]
## 
##       income      EVs lowerCIs upperCIs
##        (dbl)    (dbl)    (dbl)    (dbl)
## 1    611.000 35.42838 32.41529 38.17948
## 2   3418.556 39.09854 37.06306 41.00832
## 3   6226.111 42.76870 40.75085 44.73954
## 4   9033.667 46.43885 43.39157 49.40357
## 5  11841.222 50.10901 45.66956 54.45994
## 6  14648.778 53.77917 48.01691 59.42830
## 7  17456.333 57.44933 50.19847 64.62338
## 8  20263.889 61.11949 52.43290 69.64787
## 9  23071.444 64.78965 54.60657 74.84752
## 10 25879.000 68.45981 56.74642 80.05005

But of course it’s much better to plot them:

ggplot(scenarios, aes(x = income)) + geom_line(aes(y = EVs)) + 
  geom_ribbon(aes(ymin = lowerCIs, ymax = upperCIs), alpha = 1/3) + 
  labs(x = "Income", y = "Expected Prestige", title = "Relationship of Income and Prestige, 50% Women & 10 Years of Education")

Illustrating interaction terms with a first differences plot – including confidence intervals!

foxmodel2 <- lm(prestige ~ education + income * women, data = jobs)
summary(foxmodel2)

## 
## Call:
## lm(formula = prestige ~ education + income * women, data = jobs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4623  -5.1970  -0.4228   5.2365  16.8228 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.760e+00  3.698e+00  -1.017    0.312    
## education     3.883e+00  4.268e-01   9.098 1.19e-14 ***
## income        1.236e-03  2.793e-04   4.426 2.51e-05 ***
## women        -8.692e-02  5.599e-02  -1.552    0.124    
## income:women  2.179e-05  1.318e-05   1.653    0.102    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.778 on 97 degrees of freedom
## Multiple R-squared:  0.8037, Adjusted R-squared:  0.7956 
## F-statistic: 99.29 on 4 and 97 DF,  p-value: < 2.2e-16

What do these results mean? Let’s use first differences

Get the coefficients and vcmatrix from the model:

pe2 <- coef(foxmodel2)
vc2 <- vcov(foxmodel2)

Take 1000 draws from the multivariate normal distribution:

simbetas2 <- mvrnorm(1000, pe2, vc2)

Set up range of women employed in that occupation, and define the high and low values of income:

women_range <- seq(from = 0, to = 100, by = 10)
income_hi <- mean(jobs$income) + sd(jobs$income)
income_lo <- mean(jobs$income) - sd(jobs$income)

Create placeholders for the results:

scenarios_women <- data_frame(women = women_range, fd_mean = NA, 
  fd_lowerCI = NA, fd_upperCI = NA)

Set up a loop that goes through all 11 values in the women_range vector:

for (i in 1:nrow(scenarios_women)) {
  
  # define the covariate values for the current scenario
  x.income_hi <- c(1, 10, income_hi, women_range[i], income_hi * 
    women_range[i])
  # make sure you understand where these numbers are coming
  # from
  x.income_lo <- c(1, 10, income_lo, women_range[i], income_lo * 
    women_range[i])
  
  firstdifferences <- (simbetas2 %*% x.income_hi) - (simbetas2 %*% 
    x.income_lo)
  scenarios_women$fd_mean[i] <- mean(firstdifferences)
  scenarios_women$fd_lowerCI[i] <- quantile(firstdifferences, 
    probs = 0.025)
  scenarios_women$fd_upperCI[i] <- quantile(firstdifferences, 
    probs = 0.975)  #why these numbers instead of .5 and .95?
  
}  # close i loop

Examine the results in a table:

scenarios_women

## Source: local data frame [11 x 4]
## 
##    women  fd_mean fd_lowerCI fd_upperCI
##    (dbl)    (dbl)      (dbl)      (dbl)
## 1      0 10.56958   5.743428   15.26759
## 2     10 12.39355   7.797728   17.21062
## 3     20 14.21751   8.613678   19.95399
## 4     30 16.04147   8.989788   23.35442
## 5     40 17.86543   9.016134   27.09106
## 6     50 19.68939   8.726005   30.69773
## 7     60 21.51335   8.609789   34.31999
## 8     70 23.33731   8.135395   38.35044
## 9     80 25.16127   7.935264   42.36954
## 10    90 26.98523   7.603087   46.18401
## 11   100 28.80919   7.402275   50.09452

But of course it’s much better to plot them:

ggplot(scenarios_women, aes(x = women)) + geom_line(aes(y = fd_mean)) + 
  geom_ribbon(aes(ymin = fd_lowerCI, ymax = fd_upperCI), alpha = 1/3) + 
  labs(x = "Percentage of Individuals in an Occupation who are Women", 
    y = "Expected Increase in Prestige", title = "Effect of a change in income from 1 s.d. below to 1 s.d. above mean")

Note: Chris has a package called simcf that does this and integrates well with his package tile for plotting - more detail on this later

Challenge

Return to the first model we ran on the occupational prestige data.

foxmodel <- lm(prestige ~ education + income + women, data = jobs)
summary(foxmodel)

## 
## Call:
## lm(formula = prestige ~ education + income + women, data = jobs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8246  -5.3332  -0.1364   5.1587  17.5045 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.7943342  3.2390886  -2.098   0.0385 *  
## education    4.1866373  0.3887013  10.771  < 2e-16 ***
## income       0.0013136  0.0002778   4.729 7.58e-06 ***
## women       -0.0089052  0.0304071  -0.293   0.7702    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.846 on 98 degrees of freedom
## Multiple R-squared:  0.7982, Adjusted R-squared:  0.792 
## F-statistic: 129.2 on 3 and 98 DF,  p-value: < 2.2e-16

For each independent variable in the model, calculate the expected change in prestige that results from a change in that independent variable from one SD below its mean to one SD above its mean, along with 95 percent confidence intervals around this quantity.

Sources

Oil data from : Ross 2001.
http://staff.washington.edu/csjohns/503/lab5.r
http://staff.washington.edu/csjohns/503/lab6.r
http://staff.washington.edu/csjohns/503/lab7.r

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. R code is licensed under a BSD 2-clause license.

Lab 7 - More Regression Interpretation

Carolina Johnson, Jeffrey B. Arnold

Friday May 15, 2015

More on Interpretation

First Differences

Transformations

Post-estimation Diagnostics

Robust standard errors

Using simulation to compute expected values and confidence intervals:

Plotting Expected Values

Illustrating interaction terms with a first differences plot – including confidence intervals!

Sources