POLS/CSSS 503, University of Washington, Spring 2015

\[ \DeclareMathOperator{\logit}{logit} \]

Instructions and Introduction

  1. Create a new R project for this homework named hw4 and load that project.
  2. Download hw4-functions.R and save in to your project directory.
  3. Submit a zipped file of the directory with your R project through Canvas. This should contain all the materials for another person to run your R Markdown file This should contain:

    • The R project (.Rproj) file.
    • The R Markdown document (.Rmd) of your analyses.
    • An HTML document (.html) compiled from your R Markdown document.
    • Any data or other files needed to run the analyses in your R Markdown document.
  4. Your R Markdown file should follow the guidance from here and your R code should follow the guidelines here. You should not include many, if any, comments in your code chunks. Your discussion should be included directly in the document.

  5. Turn in a paper copy of the document compiled from the analyses at lab.
  6. You can work together on this but you should each turn in your own assignments and write up your work separately. Include the names of your collaborators on your assignment.

Something which will be of use to you later is the logit function. The logit function is defined as \[ \logit(x) = \log \left(\frac{x}{1-x} \right) = \log(x) - \log(1 - x) \] where \(x \in (0, 1)\). It takes a value between 0 and 1 and returns a value between \(-\infty\) and \(\infty\).

logit <- function(x) log(x) - log(1 - x)

Problems

Showing Confidence

We will revisit the sprinters data we considered in Problem Set 2.

sprinters <- read.csv("http://UW-POLS503.github.io/pols_503_sp15/data/sprinters.csv")
  1. Estimate the model \[ \begin{aligned}[t] \mathtt{finish}_i &= \beta_0 + \beta_1 \mathtt{year}_i + \beta_2 \mathtt{women}_i + \beta_3 \mathtt{women}_i \times \mathtt{year}_i + \epsilon_i \end{aligned} \] Create a plot of the fitted values with confidence with respect to year, grouped by women. Do not use geom_smooth().

  2. Create the same plot as in the previous part for the model \[ \begin{aligned}[t] \log (\mathtt{finish}_i) &= \beta_0 + \beta_1 \mathtt{year}_i + \beta_2 \mathtt{women}_i + \beta_3 \mathtt{year}_i \times \mathtt{women}_i + \epsilon_i \end{aligned} \] Be sure to explain in words how this specification differs from the one used in part a.

  3. Rerun the analysis and recreate the plot, adding confidence intervals, for the model: \[ \begin{aligned}[t] \mathtt{Finish}_i &= \beta_0 + \beta_1 \mathtt{year}_i + \beta_2 \mathtt{women}_i + \beta_3 \mathtt{year}_i × \mathtt{women}_i \\ & + \beta_4 \mathtt{year}_i^2 + \beta_5 \mathtt{year}_i^2 × \mathtt{women}_i + \epsilon_i \end{aligned} \] Be sure to explain in words how this specification differs from the ones used in part a. and b.
  4. Compare the visual fit of these models to the data within the observed period. Which do you find plausible fits?
  5. Do these models have different predictions for the Olympics of 2156? (Hint: extending your plots to go up to 2156 is an easy way to see this.) Why or why not?
  6. Now create a new variable, the ratio of men’s time to women’s time in each year. Logit-transform this variable and regress it on year. Plot the results, with confidence intervals, on the scale of the ratio men’s time to women’s time (i.e., transform it back from logit). Does this approach make any assumptions about men’s times or women’s times that might be problematic? Hint: You will need to manipulate the data frame for this. Your new data frame will have years as the unit of observation rather than (year, sex). You can do this using manually using filter and other post-processing. Alternatively, look at the tidyr function spread.

Model Selection: Oil & Democracy

For this problem, we will use a cleaned-up version of the dataset of Michael Ross, “Does Oil Hinder Democracy?” World Politics, 2001. That paper estimated a time series cross-section model of Polity scores regressed on oil exports and a battery of controls. In this problem, we will focus on a single cross-section, and instead focus on model fitting.

Load and pre-process this data using the following code:

rossoil <- read.csv("http://UW-POLS503.github.io/pols_503_sp15/data/rossoildata.csv") %>%
   arrange(id1, year) %>%
   group_by(id1) %>%
   mutate(oilL5 = lag(wdr123, 5) / 100,
          metalL5 = lag(wdr313, 5) / 100,
          GDPpcL5 = lag(wdr135, 5) / 100,
          islam = islam / 100) %>%
   filter(year == 1990) %>%
   select(regime1, oilL5, metalL5, GDPpcL5, islam, oecd, cty_name, id, id1) %>%
   na.omit() %>%
   ungroup()

The above code:

  • lag() calculates the lag values of oil, metal, and GDPcap. The second argument of 5, means that it calculates a lag of 5.1
  • group_by ensures that the lag values are only calculated within each country.
  • Keeps only observations from 1995 and a subset of variables
  • Omits missing values with na.omit().
  • ungroup() ensures that the data is no longer grouped by id. If you try to use summarize() while the data is still grouped by id, you would not get the results that you thought you would.

A description of the included variables follows:

Variable Description
regime1 1–10 scale increasing in democracy; computed from Polity components
oilL5 Fuel exports as a proportion of GDP, lagged 5 years
metalL5 Ore and mineral exports as a proportion of GDP, lagged 5 years
GDPcapL5 per capita GDP in PPP dollars, lagged 5 years
islam Muslims as a proportion of population, 1970 data
oecd Dummy for rich industrialized countries
cty_name The name of the country observed
id A three character abbreviation of the country name
id1 A numeric country code
  1. Estimate a linear regression of regime1 on oilL5, metalL5, GDPcapL5, islam, and oecd. Record the standard error of the regression, and calculate the expected change in regime1 given a change in oilL5 from the 50th percentile to the 95th percentile of the fully observed data, all else equal.
  2. Using the residuals from the regression in part a., create the following diagnostic plots:

    1. Plot the residuals against the fitted values
    2. Plot the residuals against each covariate
    3. Plot the studentized residuals against the standardized hat values.
    4. Calculate the heteroskedasticity consistent standard errors and compare to the classical standard errors. Use the car function hccm.

    What do these diagnostics tell you about the presence of heteroskedasticity, specification error, and outliers?

  3. Rerun the regression using either log or logit transformations on any covariates you see fit. You will likely run several specifications. In each run, record the standard error of the regression, and the expected change in regime1 given a change in oilL5 from the 50th percentile to the 95th percentile of the fully observed data. See the appendix for some tips and warnings about transforming these data, though.
  4. How much substantive difference does finding the best model make? Be specific and concrete; i.e., show what each model does. I’m asking for a more detailed answer than you usually see in articles. How much substantive doubt is there in the result if we are not sure which of the models you fit is the “right” one?
  5. Which model of those you have estimated do you trust most, and why? What other problems in the specification or estimation method remain unaddressed by our efforts?

Appendix: How Do I Log a Covariate with Zeros?

Christopher Adolph

If you try to log or logit transform a covariate \(x\) with observed zeros, you will discover a problem: you can’t log a zero! A common (but wrong) “solution” is to add a small amount to the zeros (e.g., 0.1 or 0.001, etc.). It turns out that you can introduce substantial large bias in your \(\hat \beta\)s by choosing different tiny amounts to add to your 0s: logging small numbers spreads those numbers out over a huge range. Adding 0.001 before logging a variable is not very different from subtracting 10,000 from an unlogged variable! So don’t ever do this, even as a first try.

A Solution: the logBound and logitBound Transformations

A better solution that avoids arbitrary assumptions and bias is to “dummy out” the zeros before logging. This procedure treats the zero cases as sui generis: they are uniquely different from the rest of our cases, and we estimate the way in which they are different through a separate parameter. We end up with two variables on the right-hand side: an indicator of whether \(x_i=0\), and the log (or logit) of \(x_i\) in those cases where \(x_i \ne 0\). That is, if you want to regress \(y\) on \(\textrm{log}(x)\) but \(x\) contains \(0\)s, estimate this regression: \[ y_i = \beta_0 + \beta_1 I(x_i>0) + \beta_2 \log'(x_i) + \epsilon_i \] where \(I(\cdot)\) is an indicator function and \(\log'(\cdot)\) is defined as: \[ \log'(x) = \begin{cases} 0 &\text{if $\quad x \le 0$} \\ \log(x) & \text{if $x > 0$} \end{cases} \]

If we suppose that \(x_i\) is the number of cigarettes person \(i\) smokes per day, and \(y_i\) is \(i\)’s probability of getting lung cancer, the specification makes sense: people who currently smoke even a little bit likely have a discretely higher chance of lung cancer than non-smokers, while the amount a smoker smokes may increase cancer probabilities but with diminishing marginal risk.

As you might imagine, the logic of equations 1 and 2 changes slightly if \(x\) needs to be logit transformed. Recall that the logit transformation, \[ \logit(x) = \log\left( \frac{x}{1-x} \right), \] fails if \(x\) is not between 0 and 1, so we need to dummy out \(x_i \ge 1\) and \(x_i \le 0\) separately: \[ y_i = \beta_0 + \beta_1 I(x_i>0) + \beta_2 I(x_i \ge 1) + \beta_3 \logit'(x_i) + \epsilon_i \] where \(I(\cdot)\) is an indicator function and \(\logit'(\cdot)\) is defined as: \[ \logit'(x) = \begin{cases} 0 &\mathrm{if} \quad x \le 0 \\ \log \left( \frac{x}{1-x} \right)& \mathrm{if} \quad 0<x<1 \\ 0 &\mathrm{if} \quad x \ge 1 \end{cases} \] Note that we will only need all three pieces of Equation 4 if the covariate to be logit transformed contains both 0s and 1s; should either extreme be missing, we need only add one dummy variable to the specification.

The logBound() and logitBound() functions are included in hw4-functions.R. You can load these functions with the following code (provided you downloaded hw4-functions.R and put it in your package directory):

source("http://UW-POLS503.github.io/pols_503_sp15/hw/hw4-functions.R")

After loading these functions, the above regression is as simple as:

res <- lm(y ~ logBound(x), data)

You can compare goodness of fit as usual. Moreover, you can use this technique on the right-hand side of any regression-like model, not just least squares regression.

Interpretation of results

Take special care in interpreting models in models with logBound(x) or logitBound(x) in the model formula. In setting up a hypothetical scenario for post-estimation prediction, make sure both the dummy term and the log term are set consistent with each other. For example, if the dummy is set to 0, the log must also be zero. And if the log is set to something other than 0, the dummy must be set to 1. Otherwise, you are asking the model to predict a logically impossible scenario; e.g., asking what happens when someone both smokes zero cigarettes and smokes twenty cigarettes in the same day.

I recommend either calculating the predicted values of regime1 “by hand”, or using the simcf package, as illustrated below. Our old friend predict() is very unlikely to return results for models including these terms, though if it does return an answer it will agree with other methods.

Appendix: Example Code

This section provides some example code that will make it easier to answer the homework using simulation from simcf.

If you haven’t already installed simcf, you will need to install it. You will need to have devtools installed.

library("devtools")
install_github("chrisadolph/tile-simcf", subdir = "tile")
install_github("chrisadolph/tile-simcf", subdir = "simcf")  

The following function, predOil applies simulation to the particular problem in this problem set. The arguments of predOil are:

res

The object returned by lm().

formula

Formula used in the initial call to lm() that created res.

data

The data used in the initial call to lm() that created res.

sims

The number of simulations to run.

ci

A number between 0 and 1. The confidence level to use in confidence interval.

This will return a data.frame object with columns

library("simcf")
predOil <- function(res, formula, data, sims = 1000, ci = 0.95) {
  pe <- res$coefficients
  vc <- vcov(res)
  simbetas <- mvrnorm(sims, pe, vc)
  xscen <- cfMake(formula, data = data, nscen = 1)
  xscen <- cfChange(xscen, "oilL5",
                    x = quantile(data$oilL5, probs = 0.95),
                    xpre = quantile(data$oilL5, probs = 0.5),
                    scen = 1)
  as.data.frame(linearsimfd(xscen, simbetas, ci = ci))
}

An example call to this function.

mod0_formula <- regime1 ~ oilL5 + metalL5 + GDPpcL5 + logitBound(islam) + oecd
mod0 <- lm(mod0_formula, data = rossoil)
pred0_oil <- predOil(mod0, mod0_formula, rossoil)

It may be useful to combine the estimates from several runs using the function bind_rows from dplyr.


Derived from of Christopher Adolph, “Problem Set 4”, POLS/CSSS 503, University of Washington, Spring 2014. http://faculty.washington.edu/cadolph/503/503hw4.pdf. Used with permission.


  1. The original ross95 data has separate variables with the lagged values, but I wanted to show how to calculate lags with dplyr.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. R code is licensed under a BSD 2-clause license.