POLS/CSSS 503, University of Washington, Spring 2015
This problem uses sprinters.csv which contains the winning times from the meter sprint in Olympic competitions going back to 1900.1
The dataset sprinters
contains the following variables:
Variable | Description |
---|---|
finish |
best time in seconds in the meter sprint |
year |
the year of the competition |
women |
1 if the time is women’s best; 0 if the time the men’s best. |
In R, Create a matrix \(X\) comprised of three columns: a column of ones, a column made of the variable year, and a column made up of the variable women. Create a matrix \(y\) comprised of a single column, made up of the variable finish. Now compute the following using R’s matrix commands (note that you will need to use the matrix multiplication operator %*%
): \[
b = (X' X)^{-1} X' y
\] Report the result of this calculation.
See Matrices in R for more information on how to use matrices in R.
Using the function lm
, run a regression of finish
on year
and women
. Compare the results the calculation you did in part a.
Make a nice plot summarizing this regression. On a single graph, plot the data and the regression line. Make sure the graph is labeled nicely, so that anyone who does not know your variable names could still read it.
Rerun the regression, adding an interaction between women
and year
.
Redo the plot with a new fit, one for each level of women
.
Suppose that an Olympics had been held in 2001. Use the predict
function to calculate the expected finishing time for men and for women. Calculate 95% confidence intervals for the predictions.
The authors of the Nature article were interested in predicting the finishing times for the 2156 Olympics. Use predict
to do so, for both men and women, and report 95% confidence intervals for your results.
Do you trust the model’s predictions? Is there reason to trust the 2001 prediction more than the 2156 prediction? Is any assumption of the model being abused or overworked to make this prediction? Hint: Try predicting the finishing times in the year 3000 C.E.
This question will use a dataset included with R.
data("anscombe")
The dataset consists of 4 seperate datasets each with an \(x\) and \(y\) variable. The original dataset is not a tidy dataset. The following code creates a tidy dataset of the anscombe data that is easier to analyze than the
library("dplyr")
library("tidyr")
anscombe2 <- anscombe %>%
mutate(obs = row_number()) %>%
gather(variable_dataset, value, - obs) %>%
separate(variable_dataset, c("variable", "dataset"), sep = 1L) %>%
spread(variable, value) %>%
arrange(dataset, obs)
facet_wrap
.This problem relates to your own research and final paper.
If you do not have data at this point, talk to me ASAP. It’s okay if you end up doing something different for your final paper, or are still unsure of you analysis. The point of this is to get you working with your data as soon as possible, so any problems arise early and can be dealt with now, when things can be done, and not later, when it is too late.
Derived from of Christopher Adolph, “Problem Set 2”, POLS/CSSS 503, University of Washington, Spring 2014. http://faculty.washington.edu/cadolph/503/503hw2.pdf. Used with permission.
A. J. Tatem, C. A. Guerra, P. M. Atkinson and S. I. Hay, Nature Vol. 431, p. 525 (2004).↩