POLS/CSSS 503, University of Washington, Spring 2015

Problem 1

This problem uses sprinters.csv which contains the winning times from the meter sprint in Olympic competitions going back to 1900.1

The dataset sprinters contains the following variables:

Variable Description
finish best time in seconds in the meter sprint
year the year of the competition
women 1 if the time is women’s best; 0 if the time the men’s best.
  1. In R, Create a matrix \(X\) comprised of three columns: a column of ones, a column made of the variable year, and a column made up of the variable women. Create a matrix \(y\) comprised of a single column, made up of the variable finish. Now compute the following using R’s matrix commands (note that you will need to use the matrix multiplication operator %*%): \[ b = (X' X)^{-1} X' y \] Report the result of this calculation.

    See Matrices in R for more information on how to use matrices in R.

  2. Using the function lm, run a regression of finish on year and women. Compare the results the calculation you did in part a.

  3. Make a nice plot summarizing this regression. On a single graph, plot the data and the regression line. Make sure the graph is labeled nicely, so that anyone who does not know your variable names could still read it.

  4. Rerun the regression, adding an interaction between women and year.

  5. Redo the plot with a new fit, one for each level of women.

  6. Suppose that an Olympics had been held in 2001. Use the predict function to calculate the expected finishing time for men and for women. Calculate 95% confidence intervals for the predictions.

  7. The authors of the Nature article were interested in predicting the finishing times for the 2156 Olympics. Use predict to do so, for both men and women, and report 95% confidence intervals for your results.

  8. Do you trust the model’s predictions? Is there reason to trust the 2001 prediction more than the 2156 prediction? Is any assumption of the model being abused or overworked to make this prediction? Hint: Try predicting the finishing times in the year 3000 C.E.

Problem 2

This question will use a dataset included with R.

data("anscombe")

The dataset consists of 4 seperate datasets each with an \(x\) and \(y\) variable. The original dataset is not a tidy dataset. The following code creates a tidy dataset of the anscombe data that is easier to analyze than the

library("dplyr")
library("tidyr")
anscombe2 <- anscombe %>%
    mutate(obs = row_number()) %>%
    gather(variable_dataset, value, - obs) %>%
    separate(variable_dataset, c("variable", "dataset"), sep = 1L) %>%
    spread(variable, value) %>%
    arrange(dataset, obs)
  1. For each dataset: calculate the mean and standard deviations of x and y, and correlation between x and y, and run a linear regression between x and y for each dataset. How similar do you think that these datasets will look?
  2. Create a scatter plot of each dataset and its linear regression fit. Hint: you can do this easily with facet_wrap.

Problem 3

This problem relates to your own research and final paper.

  1. Describe your data. Do you have it in a form that you can load it into R? What variables does it include? What are their descriptions and types?
  2. Describe, in as precise terms as possible, the distribution of the outcome varaible you plan to use. If you have the data in hand, a histogram would be ideal; if you do not, give a verbal description of what you expect the distribution to look like. Be sure to indicate if the data are continuous or categorical.Describe in
  3. What challenges would your data pose for analysis by least squares regression? Be sure to discuss any potential violations of the assumptions of the GaussMarkov theorem, as well as any other complications or difficulties you see in modeling your data.

If you do not have data at this point, talk to me ASAP. It’s okay if you end up doing something different for your final paper, or are still unsure of you analysis. The point of this is to get you working with your data as soon as possible, so any problems arise early and can be dealt with now, when things can be done, and not later, when it is too late.


Derived from of Christopher Adolph, “Problem Set 2”, POLS/CSSS 503, University of Washington, Spring 2014. http://faculty.washington.edu/cadolph/503/503hw2.pdf. Used with permission.


  1. A. J. Tatem, C. A. Guerra, P. M. Atkinson and S. I. Hay, Nature Vol. 431, p. 525 (2004).


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. R code is licensed under a BSD 2-clause license.