Objective: The space shuttle Challenger exploded on January 28, 1986. It was a tragedy, and one that serves as a lesson in proper data analysis. The space shuttle exploded due to an O-ring failure in the rocket caused by low temperatures. There had been a history of O-ring damages to space shuttles prior to the launch. The temperature at the time of the launch was forecast to be 31 degrees Fahrenheit. You want to analyze the relationship between temperature and O-ring damage using the data available on damages to O-rings prior to the launch.
The first chunk should load any packages we will use. If these are not installed, we may need to install them using install.packages()
.
suppressPackageStartupMessages({
library("ggplot2")
library("dplyr")
library("alr4")
})
The package ggplot2
will be used to produce plots. The package dplyr
is used for data manipulation and cleaning. The package alr4
contains a dataset Challeng
with data on t
The second chunk should load any data we will use in the analysis. This is also by convention to make it easier for the reader to understand what you will be using in the analysis.
We will
data("Challeng")
What’s in the data?
summary(Challeng)
## temp pres fail n
## Min. :53.00 Min. : 50.0 Min. :0.0000 Min. :6
## 1st Qu.:67.00 1st Qu.: 50.0 1st Qu.:0.0000 1st Qu.:6
## Median :70.00 Median :200.0 Median :0.0000 Median :6
## Mean :69.57 Mean :145.7 Mean :0.3913 Mean :6
## 3rd Qu.:75.00 3rd Qu.:200.0 3rd Qu.:1.0000 3rd Qu.:6
## Max. :81.00 Max. :200.0 Max. :2.0000 Max. :6
## erosion blowby damage
## Min. :0.0000 Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000
## Median :0.0000 Median :0.0000 Median : 0.000
## Mean :0.3478 Mean :0.1739 Mean : 1.435
## 3rd Qu.:0.5000 3rd Qu.:0.0000 3rd Qu.: 3.000
## Max. :3.0000 Max. :2.0000 Max. :11.000
The two variables we are most interested in are temp
and a measure of damage. There are several variables that measure damage to the rockets so we will choose damage
, a damage index. The damage index runs from 0 to 12.
The first step in the data analysis is to plot the variables. Save the plot result to a variable to reuse it in a later chunk.
challenger_plot_1 <- ggplot(Challeng, aes(x = temp, y = damage)) +
geom_point()
But print it so it displays
challenger_plot_1
There appears to be a negative relationship between temperature and damage (lower temperature, more damage).
Let’s run a regression of damage
on temp
. This is not the best model for this, but it is a good first cut.
challenger_model1 <- lm(damage ~ temp, data = Challeng)
summary(challenger_model1)
##
## Call:
## lm(formula = damage ~ temp, data = Challeng)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3025 -1.4507 -0.4928 0.7397 5.5337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.36508 4.43859 4.138 0.000468 ***
## temp -0.24337 0.06349 -3.833 0.000968 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.102 on 21 degrees of freedom
## Multiple R-squared: 0.4116, Adjusted R-squared: 0.3836
## F-statistic: 14.69 on 1 and 21 DF, p-value: 0.0009677
However, the statistical significance of this relationship is not particularly useful in this case. What we want to know is what is the predicted damage for a temperature of 31 degrees.
predict(challenger_model1, newdata = data.frame(temp = 31))
## 1
## 10.82052
At 31 degrees we would observe as much damage as had ever been observed before.
Let’s plot the fit of the linear regression.
challenger_plot_1 + geom_smooth(method = "lm")
This plot displays reasons why the linear model is a poor one; the predictions extend below 0 and the data don’t appear linear (exponential?).
Instead, let’s create a new variable called anydamage
that is 1 if there is any damage and 0 if there was no damage.
Challeng <- mutate(Challeng, anydamage = as.integer(damage > 0))
Now there is an additional column (anydamage
).
summary(Challeng)
## temp pres fail n
## Min. :53.00 Min. : 50.0 Min. :0.0000 Min. :6
## 1st Qu.:67.00 1st Qu.: 50.0 1st Qu.:0.0000 1st Qu.:6
## Median :70.00 Median :200.0 Median :0.0000 Median :6
## Mean :69.57 Mean :145.7 Mean :0.3913 Mean :6
## 3rd Qu.:75.00 3rd Qu.:200.0 3rd Qu.:1.0000 3rd Qu.:6
## Max. :81.00 Max. :200.0 Max. :2.0000 Max. :6
## erosion blowby damage anydamage
## Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median : 0.000 Median :0.0000
## Mean :0.3478 Mean :0.1739 Mean : 1.435 Mean :0.3043
## 3rd Qu.:0.5000 3rd Qu.:0.0000 3rd Qu.: 3.000 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.0000 Max. :11.000 Max. :1.0000
As before, the first thing we do is plot the data.
ggplot(Challeng, aes(x = temp, y = anydamage)) +
geom_point()
Instead of running a linear model, let’s run a logit model, which is a type of generalized linear model for binary data. This is run using the function glm
.
challenger_model_2 <- glm(anydamage ~ temp, family = "binomial",
data = Challeng)
summary(challenger_model_2)
##
## Call:
## glm(formula = anydamage ~ temp, family = "binomial", data = Challeng)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0611 -0.7613 -0.3783 0.4524 2.2175
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.0429 7.3786 2.039 0.0415 *
## temp -0.2322 0.1082 -2.145 0.0320 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.315 on 21 degrees of freedom
## AIC: 24.315
##
## Number of Fisher Scoring iterations: 5
The coefficients are even less useful than the linear case.
But we can calculate the probability of any damage at a temperature of 31 degrees.
predict(challenger_model_2, newdata = data.frame(temp = 31), type = "response")
## 1
## 0.9996088
The probability is over 99%!
We may be interested about how much uncertainty there was about this estimate. So we plot the fit of the model.
ggplot(Challeng, aes(x = temp, y = anydamage)) +
geom_point() +
geom_smooth(method = "glm", family = "binomial")
We analyzed the data in several ways. None of them suggest that launching at 31 degrees was a reasonable idea.
DO NOT LAUNCH!