POLS/CSSS 503, University of Washington, Spring 2015
For this lab we will use the replication data from Michael Ross’s “The Oil Curse: How Petroleum Wealth Shapes the Development of Nations.”
We will be exploring the relationship between oil dependency and democracy.
This lab will use some libraries you’ve seen before and we should load them now
library(dplyr)
library(ggplot2)
library(broom)
Read in the Ross data
rossdata <- read.csv("http://UW-POLS503.github.io/pols_503_sp15/data/ross_2012.csv",
                     stringsAsFactors = FALSE)
head (rossdata)
glimpse(rossdata)
This is a pretty big data-set, we do not need all the variables let’s subset our data to include only the variables we will use.
Create a new data-set containing only:
ctyyearpolitylogoil_gdpcap2000_sup_Q_1logGDPcapoecdCall this new data-frame data
data<-rossdata %>% 
  tbl_df() %>% 
  select(cty, year, polity, logoil_gdpcap2000_sup_Q_1, logGDPcap, oecd)
Note that I am putting this new data-frame in a tbl_df. You don’t have to do it but let’s try to use as much dplyr as possible.
Some of those names are too long, we should change it to something meaningful and short. This can be done easily using dplyr and rename.
data<-data %>%
  rename(oil=logoil_gdpcap2000_sup_Q_1, gdp=logGDPcap)
This data-frame is way easier to glimpse at:
glimpse(data)
## Observations: 8,523
## Variables: 6
## $ cty    (chr) "Afghanistan", "Afghanistan", "Afghanistan", "Afghanist...
## $ year   (int) 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1...
## $ polity (dbl) 1.00, 1.00, 1.00, 1.00, 2.35, 2.35, 2.35, 2.35, 2.35, 2...
## $ oil    (dbl) NA, NA, NA, NA, NA, NA, 0.000000, 0.000000, 3.171584, 4...
## $ gdp    (dbl) NA, NA, NA, NA, NA, 4.860139, 4.848679, 4.854403, 4.865...
## $ oecd   (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
data
## Source: local data frame [8,523 x 6]
## 
##            cty  year polity      oil      gdp  oecd
##          (chr) (int)  (dbl)    (dbl)    (dbl) (int)
## 1  Afghanistan  1960   1.00       NA       NA     0
## 2  Afghanistan  1961   1.00       NA       NA     0
## 3  Afghanistan  1962   1.00       NA       NA     0
## 4  Afghanistan  1963   1.00       NA       NA     0
## 5  Afghanistan  1964   2.35       NA       NA     0
## 6  Afghanistan  1965   2.35       NA 4.860139     0
## 7  Afghanistan  1966   2.35 0.000000 4.848679     0
## 8  Afghanistan  1967   2.35 0.000000 4.854403     0
## 9  Afghanistan  1968   2.35 3.171584 4.865864     0
## 10 Afghanistan  1969   2.35 4.154883 4.858425     0
## ..         ...   ...    ...      ...      ...   ...
A lot of missing values here. Let’s omit them and then look at a summary of the data set
data<-na.omit(data)
data %>%
  summary()
##      cty                 year          polity            oil          
##  Length:5609        Min.   :1961   Min.   : 1.000   Min.   : 0.00000  
##  Class :character   1st Qu.:1975   1st Qu.: 2.350   1st Qu.: 0.00000  
##  Mode  :character   Median :1987   Median : 5.950   Median : 0.06179  
##                     Mean   :1985   Mean   : 5.953   Mean   : 2.51778  
##                     3rd Qu.:1996   3rd Qu.: 9.550   3rd Qu.: 5.06456  
##                     Max.   :2005   Max.   :10.000   Max.   :11.00076  
##       gdp              oecd       
##  Min.   : 4.035   Min.   :0.0000  
##  1st Qu.: 6.059   1st Qu.:0.0000  
##  Median : 7.283   Median :0.0000  
##  Mean   : 7.424   Mean   :0.1756  
##  3rd Qu.: 8.712   3rd Qu.:0.0000  
##  Max.   :10.908   Max.   :1.0000
Finally! we are ready to start data-analyzing..
We are going to be exploring the relationship between Democracy level (polity) and other covariates.
Let’s explore these relationships with plots.
Create a plot that explores the relationship between democracy level and at least another variable but try to include more than two covariates using different colors and shapes.
We begin simple…
ggplot(data, aes(x = gdp, y = polity)) +
  geom_point(position = position_jitter(height = .5),  size = 3) + 
  theme_bw() 
Unfortunately, a simple scatter plot makes it hard to detect any relationship. However, ggplot2 makes it easy to add different colors and shapes which might help identify trends.
ggplot(data, aes(x = gdp, y = polity, colour = oil, shape=factor(oecd))) +
  geom_point(position = position_jitter(height = .5),  size = 3) + 
  theme_bw() 
ggplot(data, aes(x = gdp, y = polity, colour = factor(oecd))) +
  geom_point(aes(size=(oil)), position = position_jitter(height = .5)) + 
  theme_bw() 
It seems like the higher the GDP the more democratic countries are, except if you are a high oil producer or a non-OECD. Let’s explore these relationships using a regression.
model1<-lm(polity ~ oil, data=data)
summary(model1)
## 
## Call:
## lm(formula = polity ~ oil, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1436 -3.4863  0.1319  3.5112  4.5537 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.14361    0.05826 105.451  < 2e-16 ***
## oil         -0.07556    0.01468  -5.149 2.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.373 on 5607 degrees of freedom
## Multiple R-squared:  0.004706,   Adjusted R-squared:  0.004528 
## F-statistic: 26.51 on 1 and 5607 DF,  p-value: 2.711e-07
Let’s now include controls for GDP per capita and OECD membership
model2<-lm(polity ~ gdp + oil + oecd, data=data)
summary(model2)
## 
## Call:
## lm(formula = polity ~ gdp + oil + oecd, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9354 -2.2510 -0.1076  2.4055  6.1209 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.26692    0.22778  -1.172    0.241    
## gdp          0.86570    0.03386  25.564   <2e-16 ***
## oil         -0.23266    0.01313 -17.724   <2e-16 ***
## oecd         2.15876    0.13248  16.294   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.758 on 5605 degrees of freedom
## Multiple R-squared:  0.3349, Adjusted R-squared:  0.3345 
## F-statistic: 940.7 on 3 and 5605 DF,  p-value: < 2.2e-16
Recall the OECD membership clustering? Let’s try an interaction
model3<-lm(polity ~ gdp + oil*oecd, data=data)
summary(model3)
## 
## Call:
## lm(formula = polity ~ gdp + oil * oecd, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0258 -2.2644 -0.0124  2.2895  6.3659 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.43621    0.22621  -1.928   0.0539 .  
## gdp          0.90989    0.03381  26.912  < 2e-16 ***
## oil         -0.28858    0.01407 -20.515  < 2e-16 ***
## oecd         1.03588    0.16981   6.100 1.13e-09 ***
## oil:oecd     0.36751    0.03527  10.420  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.732 on 5604 degrees of freedom
## Multiple R-squared:  0.3475, Adjusted R-squared:  0.347 
## F-statistic: 746.2 on 4 and 5604 DF,  p-value: < 2.2e-16
How would you interpret these results?
broom has three main functions, all of which return data frames (not lists, numeric vectors, or other types of object). glance returns a data frame with a single row summary of the model:
glance(model2)
##   r.squared adj.r.squared    sigma statistic p.value df    logLik      AIC
## 1 0.3348738     0.3345178 2.758247  940.6576       0  4 -13647.69 27305.38
##        BIC deviance df.residual
## 1 27338.54 42642.44        5605
tidy returns a data frame with a row for each coefficient estimate:
tidy(model2)
##          term   estimate  std.error  statistic       p.value
## 1 (Intercept) -0.2669218 0.22777880  -1.171847  2.413085e-01
## 2         gdp  0.8656966 0.03386326  25.564478 1.925632e-136
## 3         oil -0.2326598 0.01312701 -17.723744  1.968195e-68
## 4        oecd  2.1587643 0.13248438  16.294482  2.330375e-58
augment returns the original data frame used in the model with additional columns for fitted values, the standard errors of those fitted values, residuals, etc.
head(augment(model2))
##   polity      gdp      oil oecd  .fitted    .se.fit     .resid
## 1   2.35 4.848679 0.000000    0 3.930563 0.07669619 -1.5805627
## 2   2.35 4.854403 0.000000    0 3.935518 0.07654974 -1.5855183
## 3   2.35 4.865864 3.171584    0 3.207540 0.08501127 -0.8575396
## 4   2.35 4.858425 4.154883    0 2.972326 0.09169896 -0.6223259
## 5   2.35 4.854891 4.447948    0 2.901082 0.09398764 -0.5510817
## 6   2.35 4.785476 4.600730    0 2.805443 0.09720522 -0.4554436
##           .hat   .sigma      .cooksd .std.resid
## 1 0.0007731810 2.758413 6.356969e-05 -0.5732532
## 2 0.0007702310 2.758412 6.372449e-05 -0.5750497
## 3 0.0009499189 2.758470 2.299821e-05 -0.3110480
## 4 0.0011052547 2.758481 1.409718e-05 -0.2257485
## 5 0.0011611145 2.758484 1.161421e-05 -0.1999102
## 6 0.0012419747 2.758487 8.486624e-06 -0.1652233
How about a coefficient plot, roppeladder… etc.
ggplot(tidy(model2) %>% filter(term != "(Intercept)"), 
       aes(x = term, y = estimate, 
           ymin = estimate - 2 * std.error, 
           ymax = estimate + 2 * std.error)) + 
  geom_pointrange() + 
  coord_flip()
We can also use coefplot
library("coefplot")
coefplot(model2, coefficients = c("oecd", "oil", "gdp"))
Several packages (stargazer, texreg, apsrtable) are useful for creating publication type regression tables. stargazer and texreg are the most complete package. Both allow output to either LaTeX or HTML tables for many types of statistical models. We’ll use stargazer here:
library(stargazer)
stargazer(model1, model2, model3, type = "html")
| Dependent variable: | |||
| polity | |||
| (1) | (2) | (3) | |
| gdp | 0.866*** | 0.910*** | |
| (0.034) | (0.034) | ||
| oil | -0.076*** | -0.233*** | -0.289*** | 
| (0.015) | (0.013) | (0.014) | |
| oecd | 2.159*** | 1.036*** | |
| (0.132) | (0.170) | ||
| oil:oecd | 0.368*** | ||
| (0.035) | |||
| Constant | 6.144*** | -0.267 | -0.436* | 
| (0.058) | (0.228) | (0.226) | |
| Observations | 5,609 | 5,609 | 5,609 | 
| R2 | 0.005 | 0.335 | 0.348 | 
| Adjusted R2 | 0.005 | 0.335 | 0.347 | 
| Residual Std. Error | 3.373 (df = 5607) | 2.758 (df = 5605) | 2.732 (df = 5604) | 
| F Statistic | 26.510*** (df = 1; 5607) | 940.658*** (df = 3; 5605) | 746.175*** (df = 4; 5604) | 
| Note: | p<0.1; p<0.05; p<0.01 | ||
We are going to use predict to get predicted values. We first have to set up a newdata
xnew <- list(gdp=5.9, oil=0, oecd=1)
predict(model2, newdata=xnew, interval="confidence")
##        fit      lwr      upr
## 1 6.999452 6.711659 7.287246
xnew2 <- list(gdp=5.9, oil=0, oecd=0)
predict(model2, newdata=xnew2, interval="confidence")
##        fit      lwr      upr
## 1 4.840688 4.732938 4.948437
What is this really doing?
model2
## 
## Call:
## lm(formula = polity ~ gdp + oil + oecd, data = data)
## 
## Coefficients:
## (Intercept)          gdp          oil         oecd  
##     -0.2669       0.8657      -0.2327       2.1588
names(model2)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
pe2<-model2$coefficients
pe2
## (Intercept)         gdp         oil        oecd 
##  -0.2669218   0.8656966  -0.2326598   2.1587643
1*pe2[1] + 5.9*pe2[2] + 0*pe2[3] + 1*pe2[4]
## (Intercept) 
##    6.999452
1*pe2[1] + 5.9*pe2[2] + 0*pe2[3] + 0*pe2[4]
## (Intercept) 
##    4.840688
We can create a matrix of hypothetical data to obtain predictions for a range of values:
# create a vector of hypothetical values of GDP per capita
gdp.hyp <-seq(4,11,by=1)
#create a matrix containing all hypothetical values, which are constant for the other covariates:
xnew <- list(gdp=gdp.hyp, oil=rep(0, length(gdp.hyp)), oecd=rep(1, length(gdp.hyp)))
xnew
## $gdp
## [1]  4  5  6  7  8  9 10 11
## 
## $oil
## [1] 0 0 0 0 0 0 0 0
## 
## $oecd
## [1] 1 1 1 1 1 1 1 1
Now we feed this new data into predict
pred.res <- predict(model2, newdata=xnew, interval="confidence")
pred.res
##         fit       lwr       upr
## 1  5.354629  4.961562  5.747695
## 2  6.220325  5.884452  6.556198
## 3  7.086022  6.803296  7.368748
## 4  7.951718  7.715347  8.188090
## 5  8.817415  8.615865  9.018965
## 6  9.683112  9.498212  9.868011
## 7 10.548808 10.357583 10.740033
## 8 11.414505 11.195964 11.633045
To plot these predicted values we have to create a data frame containing both the predicted values generated by predict and the data used to generate those values
mod2_predicted <-as.data.frame(pred.res)
mod2_pred_df <- cbind(xnew, mod2_predicted)
mod2_pred_df
##   gdp oil oecd       fit       lwr       upr
## 1   4   0    1  5.354629  4.961562  5.747695
## 2   5   0    1  6.220325  5.884452  6.556198
## 3   6   0    1  7.086022  6.803296  7.368748
## 4   7   0    1  7.951718  7.715347  8.188090
## 5   8   0    1  8.817415  8.615865  9.018965
## 6   9   0    1  9.683112  9.498212  9.868011
## 7  10   0    1 10.548808 10.357583 10.740033
## 8  11   0    1 11.414505 11.195964 11.633045
We have now have a data-frame that can easily be taken by ggplot
ggplot() +
  geom_line(data = mod2_pred_df, mapping = aes(x = gdp, y = fit)) +
  geom_ribbon(data = mod2_pred_df, mapping = aes(x = gdp, ymin = lwr, ymax = upr),
              alpha = 0.2)+
  ylab("Democracy")+
  theme_bw()
Now the model with the interaction.
summary(model3)
## 
## Call:
## lm(formula = polity ~ gdp + oil * oecd, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0258 -2.2644 -0.0124  2.2895  6.3659 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.43621    0.22621  -1.928   0.0539 .  
## gdp          0.90989    0.03381  26.912  < 2e-16 ***
## oil         -0.28858    0.01407 -20.515  < 2e-16 ***
## oecd         1.03588    0.16981   6.100 1.13e-09 ***
## oil:oecd     0.36751    0.03527  10.420  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.732 on 5604 degrees of freedom
## Multiple R-squared:  0.3475, Adjusted R-squared:  0.347 
## F-statistic: 746.2 on 4 and 5604 DF,  p-value: < 2.2e-16
oil.hyp <- seq(0,11,by=1)
xnew1 <- list(oil=oil.hyp, gdp=rep(mean(data$gdp), length(oil.hyp)), oecd=rep(0, length(oil.hyp)))
xnew1
## $oil
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11
## 
## $gdp
##  [1] 7.424049 7.424049 7.424049 7.424049 7.424049 7.424049 7.424049
##  [8] 7.424049 7.424049 7.424049 7.424049 7.424049
## 
## $oecd
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0
pred.res1 <- predict(model3, newdata=xnew1, interval="confidence")
pred.res1
##         fit      lwr      upr
## 1  6.318855 6.202172 6.435538
## 2  6.030278 5.930781 6.129776
## 3  5.741702 5.653920 5.829483
## 4  5.453125 5.369266 5.536984
## 5  5.164548 5.075779 5.253317
## 6  4.875971 4.774737 4.977206
## 7  4.587395 4.468492 4.706298
## 8  4.298818 4.159003 4.438633
## 9  4.010241 3.847516 4.172966
## 10 3.721665 3.534765 3.908564
## 11 3.433088 3.221181 3.644995
## 12 3.144511 2.907028 3.381995
xnew2 <- list(oil=oil.hyp, gdp=rep(mean(data$gdp), length(oil.hyp)), oecd=rep(1, length(oil.hyp)))
xnew2
## $oil
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11
## 
## $gdp
##  [1] 7.424049 7.424049 7.424049 7.424049 7.424049 7.424049 7.424049
##  [8] 7.424049 7.424049 7.424049 7.424049 7.424049
## 
## $oecd
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1
pred.res2 <- predict(model3, newdata=xnew2, interval="confidence")
How would you construct this data-frame to be used with ggplot?
mod3_predicted_1 <- as.data.frame(pred.res1)
mod3_pred_df_1  <- cbind(xnew1, mod3_predicted_1)
mod3_predicted_2 <- as.data.frame(pred.res2)
mod3_pred_df_2 <- cbind(xnew2, mod3_predicted_2)
mod3_pred_df <- bind_rows(mod3_pred_df_1,mod3_pred_df_2)
mod3_pred_df
## Source: local data frame [24 x 6]
## 
##      oil      gdp  oecd      fit      lwr      upr
##    (dbl)    (dbl) (dbl)    (dbl)    (dbl)    (dbl)
## 1      0 7.424049     0 6.318855 6.202172 6.435538
## 2      1 7.424049     0 6.030278 5.930781 6.129776
## 3      2 7.424049     0 5.741702 5.653920 5.829483
## 4      3 7.424049     0 5.453125 5.369266 5.536984
## 5      4 7.424049     0 5.164548 5.075779 5.253317
## 6      5 7.424049     0 4.875971 4.774737 4.977206
## 7      6 7.424049     0 4.587395 4.468492 4.706298
## 8      7 7.424049     0 4.298818 4.159003 4.438633
## 9      8 7.424049     0 4.010241 3.847516 4.172966
## 10     9 7.424049     0 3.721665 3.534765 3.908564
## ..   ...      ...   ...      ...      ...      ...
Now we can ggplot it
ggplot(mod3_pred_df, aes(x =oil , y = fit, 
                         ymin = lwr, ymax = upr)) +
  geom_line(mapping = aes(colour = factor(oecd))) +
  geom_ribbon(mapping = aes(fill = factor(oecd)), alpha = 0.7) +
  ylab("Democracy") + 
  ggtitle("Predicted values of Democracy by GDP and OECD membership")+
  scale_fill_discrete("OECD")+
  scale_colour_discrete("OECD")+
  theme_bw()