POLS/CSSS 503, University of Washington, Spring 2015
For this lab we will use the replication data from Michael Ross’s “The Oil Curse: How Petroleum Wealth Shapes the Development of Nations.”
We will be exploring the relationship between oil dependency and democracy.
This lab will use some libraries you’ve seen before and we should load them now
library(dplyr)
library(ggplot2)
library(broom)
Read in the Ross data
rossdata <- read.csv("http://UW-POLS503.github.io/pols_503_sp15/data/ross_2012.csv",
stringsAsFactors = FALSE)
head (rossdata)
glimpse(rossdata)
This is a pretty big data-set, we do not need all the variables let’s subset our data to include only the variables we will use.
Create a new data-set containing only:
cty
year
polity
logoil_gdpcap2000_sup_Q_1
logGDPcap
oecd
Call this new data-frame data
data<-rossdata %>%
tbl_df() %>%
select(cty, year, polity, logoil_gdpcap2000_sup_Q_1, logGDPcap, oecd)
Note that I am putting this new data-frame in a tbl_df
. You don’t have to do it but let’s try to use as much dplyr
as possible.
Some of those names are too long, we should change it to something meaningful and short. This can be done easily using dplyr
and rename
.
data<-data %>%
rename(oil=logoil_gdpcap2000_sup_Q_1, gdp=logGDPcap)
This data-frame is way easier to glimpse at:
glimpse(data)
## Observations: 8,523
## Variables: 6
## $ cty (chr) "Afghanistan", "Afghanistan", "Afghanistan", "Afghanist...
## $ year (int) 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1...
## $ polity (dbl) 1.00, 1.00, 1.00, 1.00, 2.35, 2.35, 2.35, 2.35, 2.35, 2...
## $ oil (dbl) NA, NA, NA, NA, NA, NA, 0.000000, 0.000000, 3.171584, 4...
## $ gdp (dbl) NA, NA, NA, NA, NA, 4.860139, 4.848679, 4.854403, 4.865...
## $ oecd (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
data
## Source: local data frame [8,523 x 6]
##
## cty year polity oil gdp oecd
## (chr) (int) (dbl) (dbl) (dbl) (int)
## 1 Afghanistan 1960 1.00 NA NA 0
## 2 Afghanistan 1961 1.00 NA NA 0
## 3 Afghanistan 1962 1.00 NA NA 0
## 4 Afghanistan 1963 1.00 NA NA 0
## 5 Afghanistan 1964 2.35 NA NA 0
## 6 Afghanistan 1965 2.35 NA 4.860139 0
## 7 Afghanistan 1966 2.35 0.000000 4.848679 0
## 8 Afghanistan 1967 2.35 0.000000 4.854403 0
## 9 Afghanistan 1968 2.35 3.171584 4.865864 0
## 10 Afghanistan 1969 2.35 4.154883 4.858425 0
## .. ... ... ... ... ... ...
A lot of missing values here. Let’s omit them and then look at a summary of the data set
data<-na.omit(data)
data %>%
summary()
## cty year polity oil
## Length:5609 Min. :1961 Min. : 1.000 Min. : 0.00000
## Class :character 1st Qu.:1975 1st Qu.: 2.350 1st Qu.: 0.00000
## Mode :character Median :1987 Median : 5.950 Median : 0.06179
## Mean :1985 Mean : 5.953 Mean : 2.51778
## 3rd Qu.:1996 3rd Qu.: 9.550 3rd Qu.: 5.06456
## Max. :2005 Max. :10.000 Max. :11.00076
## gdp oecd
## Min. : 4.035 Min. :0.0000
## 1st Qu.: 6.059 1st Qu.:0.0000
## Median : 7.283 Median :0.0000
## Mean : 7.424 Mean :0.1756
## 3rd Qu.: 8.712 3rd Qu.:0.0000
## Max. :10.908 Max. :1.0000
Finally! we are ready to start data-analyzing..
We are going to be exploring the relationship between Democracy level (polity) and other covariates.
Let’s explore these relationships with plots.
Create a plot that explores the relationship between democracy level and at least another variable but try to include more than two covariates using different colors and shapes.
We begin simple…
ggplot(data, aes(x = gdp, y = polity)) +
geom_point(position = position_jitter(height = .5), size = 3) +
theme_bw()
Unfortunately, a simple scatter plot makes it hard to detect any relationship. However, ggplot2
makes it easy to add different colors and shapes which might help identify trends.
ggplot(data, aes(x = gdp, y = polity, colour = oil, shape=factor(oecd))) +
geom_point(position = position_jitter(height = .5), size = 3) +
theme_bw()
ggplot(data, aes(x = gdp, y = polity, colour = factor(oecd))) +
geom_point(aes(size=(oil)), position = position_jitter(height = .5)) +
theme_bw()
It seems like the higher the GDP the more democratic countries are, except if you are a high oil producer or a non-OECD. Let’s explore these relationships using a regression.
model1<-lm(polity ~ oil, data=data)
summary(model1)
##
## Call:
## lm(formula = polity ~ oil, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1436 -3.4863 0.1319 3.5112 4.5537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.14361 0.05826 105.451 < 2e-16 ***
## oil -0.07556 0.01468 -5.149 2.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.373 on 5607 degrees of freedom
## Multiple R-squared: 0.004706, Adjusted R-squared: 0.004528
## F-statistic: 26.51 on 1 and 5607 DF, p-value: 2.711e-07
Let’s now include controls for GDP per capita and OECD membership
model2<-lm(polity ~ gdp + oil + oecd, data=data)
summary(model2)
##
## Call:
## lm(formula = polity ~ gdp + oil + oecd, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9354 -2.2510 -0.1076 2.4055 6.1209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.26692 0.22778 -1.172 0.241
## gdp 0.86570 0.03386 25.564 <2e-16 ***
## oil -0.23266 0.01313 -17.724 <2e-16 ***
## oecd 2.15876 0.13248 16.294 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.758 on 5605 degrees of freedom
## Multiple R-squared: 0.3349, Adjusted R-squared: 0.3345
## F-statistic: 940.7 on 3 and 5605 DF, p-value: < 2.2e-16
Recall the OECD membership clustering? Let’s try an interaction
model3<-lm(polity ~ gdp + oil*oecd, data=data)
summary(model3)
##
## Call:
## lm(formula = polity ~ gdp + oil * oecd, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0258 -2.2644 -0.0124 2.2895 6.3659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.43621 0.22621 -1.928 0.0539 .
## gdp 0.90989 0.03381 26.912 < 2e-16 ***
## oil -0.28858 0.01407 -20.515 < 2e-16 ***
## oecd 1.03588 0.16981 6.100 1.13e-09 ***
## oil:oecd 0.36751 0.03527 10.420 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.732 on 5604 degrees of freedom
## Multiple R-squared: 0.3475, Adjusted R-squared: 0.347
## F-statistic: 746.2 on 4 and 5604 DF, p-value: < 2.2e-16
How would you interpret these results?
broom
has three main functions, all of which return data frames (not lists, numeric vectors, or other types of object). glance
returns a data frame with a single row summary of the model:
glance(model2)
## r.squared adj.r.squared sigma statistic p.value df logLik AIC
## 1 0.3348738 0.3345178 2.758247 940.6576 0 4 -13647.69 27305.38
## BIC deviance df.residual
## 1 27338.54 42642.44 5605
tidy
returns a data frame with a row for each coefficient estimate:
tidy(model2)
## term estimate std.error statistic p.value
## 1 (Intercept) -0.2669218 0.22777880 -1.171847 2.413085e-01
## 2 gdp 0.8656966 0.03386326 25.564478 1.925632e-136
## 3 oil -0.2326598 0.01312701 -17.723744 1.968195e-68
## 4 oecd 2.1587643 0.13248438 16.294482 2.330375e-58
augment
returns the original data frame used in the model with additional columns for fitted values, the standard errors of those fitted values, residuals, etc.
head(augment(model2))
## polity gdp oil oecd .fitted .se.fit .resid
## 1 2.35 4.848679 0.000000 0 3.930563 0.07669619 -1.5805627
## 2 2.35 4.854403 0.000000 0 3.935518 0.07654974 -1.5855183
## 3 2.35 4.865864 3.171584 0 3.207540 0.08501127 -0.8575396
## 4 2.35 4.858425 4.154883 0 2.972326 0.09169896 -0.6223259
## 5 2.35 4.854891 4.447948 0 2.901082 0.09398764 -0.5510817
## 6 2.35 4.785476 4.600730 0 2.805443 0.09720522 -0.4554436
## .hat .sigma .cooksd .std.resid
## 1 0.0007731810 2.758413 6.356969e-05 -0.5732532
## 2 0.0007702310 2.758412 6.372449e-05 -0.5750497
## 3 0.0009499189 2.758470 2.299821e-05 -0.3110480
## 4 0.0011052547 2.758481 1.409718e-05 -0.2257485
## 5 0.0011611145 2.758484 1.161421e-05 -0.1999102
## 6 0.0012419747 2.758487 8.486624e-06 -0.1652233
How about a coefficient plot, roppeladder… etc.
ggplot(tidy(model2) %>% filter(term != "(Intercept)"),
aes(x = term, y = estimate,
ymin = estimate - 2 * std.error,
ymax = estimate + 2 * std.error)) +
geom_pointrange() +
coord_flip()
We can also use coefplot
library("coefplot")
coefplot(model2, coefficients = c("oecd", "oil", "gdp"))
Several packages (stargazer
, texreg
, apsrtable
) are useful for creating publication type regression tables. stargazer
and texreg
are the most complete package. Both allow output to either LaTeX or HTML tables for many types of statistical models. We’ll use stargazer here:
library(stargazer)
stargazer(model1, model2, model3, type = "html")
Dependent variable: | |||
polity | |||
(1) | (2) | (3) | |
gdp | 0.866*** | 0.910*** | |
(0.034) | (0.034) | ||
oil | -0.076*** | -0.233*** | -0.289*** |
(0.015) | (0.013) | (0.014) | |
oecd | 2.159*** | 1.036*** | |
(0.132) | (0.170) | ||
oil:oecd | 0.368*** | ||
(0.035) | |||
Constant | 6.144*** | -0.267 | -0.436* |
(0.058) | (0.228) | (0.226) | |
Observations | 5,609 | 5,609 | 5,609 |
R2 | 0.005 | 0.335 | 0.348 |
Adjusted R2 | 0.005 | 0.335 | 0.347 |
Residual Std. Error | 3.373 (df = 5607) | 2.758 (df = 5605) | 2.732 (df = 5604) |
F Statistic | 26.510*** (df = 1; 5607) | 940.658*** (df = 3; 5605) | 746.175*** (df = 4; 5604) |
Note: | p<0.1; p<0.05; p<0.01 |
We are going to use predict
to get predicted values. We first have to set up a newdata
xnew <- list(gdp=5.9, oil=0, oecd=1)
predict(model2, newdata=xnew, interval="confidence")
## fit lwr upr
## 1 6.999452 6.711659 7.287246
xnew2 <- list(gdp=5.9, oil=0, oecd=0)
predict(model2, newdata=xnew2, interval="confidence")
## fit lwr upr
## 1 4.840688 4.732938 4.948437
What is this really doing?
model2
##
## Call:
## lm(formula = polity ~ gdp + oil + oecd, data = data)
##
## Coefficients:
## (Intercept) gdp oil oecd
## -0.2669 0.8657 -0.2327 2.1588
names(model2)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
pe2<-model2$coefficients
pe2
## (Intercept) gdp oil oecd
## -0.2669218 0.8656966 -0.2326598 2.1587643
1*pe2[1] + 5.9*pe2[2] + 0*pe2[3] + 1*pe2[4]
## (Intercept)
## 6.999452
1*pe2[1] + 5.9*pe2[2] + 0*pe2[3] + 0*pe2[4]
## (Intercept)
## 4.840688
We can create a matrix of hypothetical data to obtain predictions for a range of values:
# create a vector of hypothetical values of GDP per capita
gdp.hyp <-seq(4,11,by=1)
#create a matrix containing all hypothetical values, which are constant for the other covariates:
xnew <- list(gdp=gdp.hyp, oil=rep(0, length(gdp.hyp)), oecd=rep(1, length(gdp.hyp)))
xnew
## $gdp
## [1] 4 5 6 7 8 9 10 11
##
## $oil
## [1] 0 0 0 0 0 0 0 0
##
## $oecd
## [1] 1 1 1 1 1 1 1 1
Now we feed this new data into predict
pred.res <- predict(model2, newdata=xnew, interval="confidence")
pred.res
## fit lwr upr
## 1 5.354629 4.961562 5.747695
## 2 6.220325 5.884452 6.556198
## 3 7.086022 6.803296 7.368748
## 4 7.951718 7.715347 8.188090
## 5 8.817415 8.615865 9.018965
## 6 9.683112 9.498212 9.868011
## 7 10.548808 10.357583 10.740033
## 8 11.414505 11.195964 11.633045
To plot these predicted values we have to create a data frame containing both the predicted values generated by predict and the data used to generate those values
mod2_predicted <-as.data.frame(pred.res)
mod2_pred_df <- cbind(xnew, mod2_predicted)
mod2_pred_df
## gdp oil oecd fit lwr upr
## 1 4 0 1 5.354629 4.961562 5.747695
## 2 5 0 1 6.220325 5.884452 6.556198
## 3 6 0 1 7.086022 6.803296 7.368748
## 4 7 0 1 7.951718 7.715347 8.188090
## 5 8 0 1 8.817415 8.615865 9.018965
## 6 9 0 1 9.683112 9.498212 9.868011
## 7 10 0 1 10.548808 10.357583 10.740033
## 8 11 0 1 11.414505 11.195964 11.633045
We have now have a data-frame that can easily be taken by ggplot
ggplot() +
geom_line(data = mod2_pred_df, mapping = aes(x = gdp, y = fit)) +
geom_ribbon(data = mod2_pred_df, mapping = aes(x = gdp, ymin = lwr, ymax = upr),
alpha = 0.2)+
ylab("Democracy")+
theme_bw()
Now the model with the interaction.
summary(model3)
##
## Call:
## lm(formula = polity ~ gdp + oil * oecd, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0258 -2.2644 -0.0124 2.2895 6.3659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.43621 0.22621 -1.928 0.0539 .
## gdp 0.90989 0.03381 26.912 < 2e-16 ***
## oil -0.28858 0.01407 -20.515 < 2e-16 ***
## oecd 1.03588 0.16981 6.100 1.13e-09 ***
## oil:oecd 0.36751 0.03527 10.420 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.732 on 5604 degrees of freedom
## Multiple R-squared: 0.3475, Adjusted R-squared: 0.347
## F-statistic: 746.2 on 4 and 5604 DF, p-value: < 2.2e-16
oil.hyp <- seq(0,11,by=1)
xnew1 <- list(oil=oil.hyp, gdp=rep(mean(data$gdp), length(oil.hyp)), oecd=rep(0, length(oil.hyp)))
xnew1
## $oil
## [1] 0 1 2 3 4 5 6 7 8 9 10 11
##
## $gdp
## [1] 7.424049 7.424049 7.424049 7.424049 7.424049 7.424049 7.424049
## [8] 7.424049 7.424049 7.424049 7.424049 7.424049
##
## $oecd
## [1] 0 0 0 0 0 0 0 0 0 0 0 0
pred.res1 <- predict(model3, newdata=xnew1, interval="confidence")
pred.res1
## fit lwr upr
## 1 6.318855 6.202172 6.435538
## 2 6.030278 5.930781 6.129776
## 3 5.741702 5.653920 5.829483
## 4 5.453125 5.369266 5.536984
## 5 5.164548 5.075779 5.253317
## 6 4.875971 4.774737 4.977206
## 7 4.587395 4.468492 4.706298
## 8 4.298818 4.159003 4.438633
## 9 4.010241 3.847516 4.172966
## 10 3.721665 3.534765 3.908564
## 11 3.433088 3.221181 3.644995
## 12 3.144511 2.907028 3.381995
xnew2 <- list(oil=oil.hyp, gdp=rep(mean(data$gdp), length(oil.hyp)), oecd=rep(1, length(oil.hyp)))
xnew2
## $oil
## [1] 0 1 2 3 4 5 6 7 8 9 10 11
##
## $gdp
## [1] 7.424049 7.424049 7.424049 7.424049 7.424049 7.424049 7.424049
## [8] 7.424049 7.424049 7.424049 7.424049 7.424049
##
## $oecd
## [1] 1 1 1 1 1 1 1 1 1 1 1 1
pred.res2 <- predict(model3, newdata=xnew2, interval="confidence")
How would you construct this data-frame to be used with ggplot
?
mod3_predicted_1 <- as.data.frame(pred.res1)
mod3_pred_df_1 <- cbind(xnew1, mod3_predicted_1)
mod3_predicted_2 <- as.data.frame(pred.res2)
mod3_pred_df_2 <- cbind(xnew2, mod3_predicted_2)
mod3_pred_df <- bind_rows(mod3_pred_df_1,mod3_pred_df_2)
mod3_pred_df
## Source: local data frame [24 x 6]
##
## oil gdp oecd fit lwr upr
## (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 0 7.424049 0 6.318855 6.202172 6.435538
## 2 1 7.424049 0 6.030278 5.930781 6.129776
## 3 2 7.424049 0 5.741702 5.653920 5.829483
## 4 3 7.424049 0 5.453125 5.369266 5.536984
## 5 4 7.424049 0 5.164548 5.075779 5.253317
## 6 5 7.424049 0 4.875971 4.774737 4.977206
## 7 6 7.424049 0 4.587395 4.468492 4.706298
## 8 7 7.424049 0 4.298818 4.159003 4.438633
## 9 8 7.424049 0 4.010241 3.847516 4.172966
## 10 9 7.424049 0 3.721665 3.534765 3.908564
## .. ... ... ... ... ... ...
Now we can ggplot it
ggplot(mod3_pred_df, aes(x =oil , y = fit,
ymin = lwr, ymax = upr)) +
geom_line(mapping = aes(colour = factor(oecd))) +
geom_ribbon(mapping = aes(fill = factor(oecd)), alpha = 0.7) +
ylab("Democracy") +
ggtitle("Predicted values of Democracy by GDP and OECD membership")+
scale_fill_discrete("OECD")+
scale_colour_discrete("OECD")+
theme_bw()