POLS/CSSS 503, University of Washington, Spring 2015
\[ \DeclareMathOperator{\logit}{logit} \]
hw4
and load that project.Submit a zipped file of the directory with your R project through Canvas. This should contain all the materials for another person to run your R Markdown file This should contain:
.Rproj
) file..Rmd
) of your analyses..html
) compiled from your R Markdown document.Your R Markdown file should follow the guidance from here and your R code should follow the guidelines here. You should not include many, if any, comments in your code chunks. Your discussion should be included directly in the document.
You can work together on this but you should each turn in your own assignments and write up your work separately. Include the names of your collaborators on your assignment.
Something which will be of use to you later is the logit function. The logit function is defined as \[ \logit(x) = \log \left(\frac{x}{1-x} \right) = \log(x) - \log(1 - x) \] where \(x \in (0, 1)\). It takes a value between 0 and 1 and returns a value between \(-\infty\) and \(\infty\).
logit <- function(x) log(x) - log(1 - x)
We will revisit the sprinters data we considered in Problem Set 2.
sprinters <- read.csv("http://UW-POLS503.github.io/pols_503_sp15/data/sprinters.csv")
Estimate the model \[
\begin{aligned}[t]
\mathtt{finish}_i &= \beta_0 + \beta_1 \mathtt{year}_i + \beta_2 \mathtt{women}_i + \beta_3 \mathtt{women}_i \times \mathtt{year}_i + \epsilon_i
\end{aligned}
\] Create a plot of the fitted values with confidence with respect to year
, grouped by women
. Do not use geom_smooth()
.
Create the same plot as in the previous part for the model \[ \begin{aligned}[t] \log (\mathtt{finish}_i) &= \beta_0 + \beta_1 \mathtt{year}_i + \beta_2 \mathtt{women}_i + \beta_3 \mathtt{year}_i \times \mathtt{women}_i + \epsilon_i \end{aligned} \] Be sure to explain in words how this specification differs from the one used in part a.
Now create a new variable, the ratio of men’s time to women’s time in each year. Logit-transform this variable and regress it on year. Plot the results, with confidence intervals, on the scale of the ratio men’s time to women’s time (i.e., transform it back from logit). Does this approach make any assumptions about men’s times or women’s times that might be problematic? Hint: You will need to manipulate the data frame for this. Your new data frame will have years as the unit of observation rather than (year, sex). You can do this using manually using filter
and other post-processing. Alternatively, look at the tidyr function spread
.
For this problem, we will use a cleaned-up version of the dataset of Michael Ross, “Does Oil Hinder Democracy?” World Politics, 2001. That paper estimated a time series cross-section model of Polity scores regressed on oil exports and a battery of controls. In this problem, we will focus on a single cross-section, and instead focus on model fitting.
Load and pre-process this data using the following code:
rossoil <- read.csv("http://UW-POLS503.github.io/pols_503_sp15/data/rossoildata.csv") %>%
arrange(id1, year) %>%
group_by(id1) %>%
mutate(oilL5 = lag(wdr123, 5) / 100,
metalL5 = lag(wdr313, 5) / 100,
GDPpcL5 = lag(wdr135, 5) / 100,
islam = islam / 100) %>%
filter(year == 1990) %>%
select(regime1, oilL5, metalL5, GDPpcL5, islam, oecd, cty_name, id, id1) %>%
na.omit() %>%
ungroup()
The above code:
lag()
calculates the lag values of oil
, metal
, and GDPcap
. The second argument of 5
, means that it calculates a lag of 5.1group_by
ensures that the lag values are only calculated within each country.na.omit()
.ungroup()
ensures that the data is no longer grouped by id
. If you try to use summarize()
while the data is still grouped by id
, you would not get the results that you thought you would.A description of the included variables follows:
Variable | Description |
---|---|
regime1 |
1–10 scale increasing in democracy; computed from Polity components |
oilL5 |
Fuel exports as a proportion of GDP, lagged 5 years |
metalL5 |
Ore and mineral exports as a proportion of GDP, lagged 5 years |
GDPcapL5 |
per capita GDP in PPP dollars, lagged 5 years |
islam |
Muslims as a proportion of population, 1970 data |
oecd |
Dummy for rich industrialized countries |
cty_name |
The name of the country observed |
id |
A three character abbreviation of the country name |
id1 |
A numeric country code |
regime1
on oilL5
, metalL5
, GDPcapL5
, islam
, and oecd
. Record the standard error of the regression, and calculate the expected change in regime1
given a change in oilL5
from the 50th percentile to the 95th percentile of the fully observed data, all else equal.Using the residuals from the regression in part a., create the following diagnostic plots:
hccm
.What do these diagnostics tell you about the presence of heteroskedasticity, specification error, and outliers?
regime1
given a change in oilL5
from the 50th percentile to the 95th percentile of the fully observed data. See the appendix for some tips and warnings about transforming these data, though.Which model of those you have estimated do you trust most, and why? What other problems in the specification or estimation method remain unaddressed by our efforts?
Christopher Adolph
If you try to log or logit transform a covariate \(x\) with observed zeros, you will discover a problem: you can’t log a zero! A common (but wrong) “solution” is to add a small amount to the zeros (e.g., 0.1 or 0.001, etc.). It turns out that you can introduce substantial large bias in your \(\hat \beta\)s by choosing different tiny amounts to add to your 0s: logging small numbers spreads those numbers out over a huge range. Adding 0.001 before logging a variable is not very different from subtracting 10,000 from an unlogged variable! So don’t ever do this, even as a first try.
A better solution that avoids arbitrary assumptions and bias is to “dummy out” the zeros before logging. This procedure treats the zero cases as sui generis: they are uniquely different from the rest of our cases, and we estimate the way in which they are different through a separate parameter. We end up with two variables on the right-hand side: an indicator of whether \(x_i=0\), and the log (or logit) of \(x_i\) in those cases where \(x_i \ne 0\). That is, if you want to regress \(y\) on \(\textrm{log}(x)\) but \(x\) contains \(0\)s, estimate this regression: \[ y_i = \beta_0 + \beta_1 I(x_i>0) + \beta_2 \log'(x_i) + \epsilon_i \] where \(I(\cdot)\) is an indicator function and \(\log'(\cdot)\) is defined as: \[ \log'(x) = \begin{cases} 0 &\text{if $\quad x \le 0$} \\ \log(x) & \text{if $x > 0$} \end{cases} \]
If we suppose that \(x_i\) is the number of cigarettes person \(i\) smokes per day, and \(y_i\) is \(i\)’s probability of getting lung cancer, the specification makes sense: people who currently smoke even a little bit likely have a discretely higher chance of lung cancer than non-smokers, while the amount a smoker smokes may increase cancer probabilities but with diminishing marginal risk.
As you might imagine, the logic of equations 1 and 2 changes slightly if \(x\) needs to be logit transformed. Recall that the logit transformation, \[ \logit(x) = \log\left( \frac{x}{1-x} \right), \] fails if \(x\) is not between 0 and 1, so we need to dummy out \(x_i \ge 1\) and \(x_i \le 0\) separately: \[ y_i = \beta_0 + \beta_1 I(x_i>0) + \beta_2 I(x_i \ge 1) + \beta_3 \logit'(x_i) + \epsilon_i \] where \(I(\cdot)\) is an indicator function and \(\logit'(\cdot)\) is defined as: \[ \logit'(x) = \begin{cases} 0 &\mathrm{if} \quad x \le 0 \\ \log \left( \frac{x}{1-x} \right)& \mathrm{if} \quad 0<x<1 \\ 0 &\mathrm{if} \quad x \ge 1 \end{cases} \] Note that we will only need all three pieces of Equation 4 if the covariate to be logit transformed contains both 0s and 1s; should either extreme be missing, we need only add one dummy variable to the specification.
The logBound()
and logitBound()
functions are included in hw4-functions.R
. You can load these functions with the following code (provided you downloaded hw4-functions.R
and put it in your package directory):
source("http://UW-POLS503.github.io/pols_503_sp15/hw/hw4-functions.R")
After loading these functions, the above regression is as simple as:
res <- lm(y ~ logBound(x), data)
You can compare goodness of fit as usual. Moreover, you can use this technique on the right-hand side of any regression-like model, not just least squares regression.
Take special care in interpreting models in models with logBound(x)
or logitBound(x)
in the model formula. In setting up a hypothetical scenario for post-estimation prediction, make sure both the dummy term and the log term are set consistent with each other. For example, if the dummy is set to 0, the log must also be zero. And if the log is set to something other than 0, the dummy must be set to 1. Otherwise, you are asking the model to predict a logically impossible scenario; e.g., asking what happens when someone both smokes zero cigarettes and smokes twenty cigarettes in the same day.
I recommend either calculating the predicted values of regime1
“by hand”, or using the simcf package, as illustrated below. Our old friend predict()
is very unlikely to return results for models including these terms, though if it does return an answer it will agree with other methods.
This section provides some example code that will make it easier to answer the homework using simulation from simcf.
If you haven’t already installed simcf, you will need to install it. You will need to have devtools installed.
library("devtools")
install_github("chrisadolph/tile-simcf", subdir = "tile")
install_github("chrisadolph/tile-simcf", subdir = "simcf")
The following function, predOil
applies simulation to the particular problem in this problem set. The arguments of predOil
are:
The object returned by lm()
.
Formula used in the initial call to lm()
that created res
.
The data used in the initial call to lm()
that created res
.
The number of simulations to run.
A number between 0 and 1. The confidence level to use in confidence interval.
This will return a data.frame
object with columns
pe
: Point estimatelower
: Lower bound of the confidence intervalupper
: Upper bound of the confidence intervallibrary("simcf")
predOil <- function(res, formula, data, sims = 1000, ci = 0.95) {
pe <- res$coefficients
vc <- vcov(res)
simbetas <- mvrnorm(sims, pe, vc)
xscen <- cfMake(formula, data = data, nscen = 1)
xscen <- cfChange(xscen, "oilL5",
x = quantile(data$oilL5, probs = 0.95),
xpre = quantile(data$oilL5, probs = 0.5),
scen = 1)
as.data.frame(linearsimfd(xscen, simbetas, ci = ci))
}
An example call to this function.
mod0_formula <- regime1 ~ oilL5 + metalL5 + GDPpcL5 + logitBound(islam) + oecd
mod0 <- lm(mod0_formula, data = rossoil)
pred0_oil <- predOil(mod0, mod0_formula, rossoil)
It may be useful to combine the estimates from several runs using the function bind_rows
from dplyr.
Derived from of Christopher Adolph, “Problem Set 4”, POLS/CSSS 503, University of Washington, Spring 2014. http://faculty.washington.edu/cadolph/503/503hw4.pdf. Used with permission.
The original ross95
data has separate variables with the lagged values, but I wanted to show how to calculate lags with dplyr
.↩