Objective: The space shuttle Challenger exploded on January 28, 1986. It was a tragedy, and one that serves as a lesson in proper data analysis. The space shuttle exploded due to an O-ring failure in the rocket caused by low temperatures. There had been a history of O-ring damages to space shuttles prior to the launch. The temperature at the time of the launch was forecast to be 31 degrees Fahrenheit. You want to analyze the relationship between temperature and O-ring damage using the data available on damages to O-rings prior to the launch.

The first chunk should load any packages we will use. If these are not installed, we may need to install them using install.packages().

suppressPackageStartupMessages({
  library("ggplot2")
  library("dplyr")
  library("alr4")
})

The package ggplot2 will be used to produce plots. The package dplyr is used for data manipulation and cleaning. The package alr4 contains a dataset Challeng with data on t

The second chunk should load any data we will use in the analysis. This is also by convention to make it easier for the reader to understand what you will be using in the analysis.

We will

data("Challeng")

What’s in the data?

summary(Challeng)
##       temp            pres            fail              n    
##  Min.   :53.00   Min.   : 50.0   Min.   :0.0000   Min.   :6  
##  1st Qu.:67.00   1st Qu.: 50.0   1st Qu.:0.0000   1st Qu.:6  
##  Median :70.00   Median :200.0   Median :0.0000   Median :6  
##  Mean   :69.57   Mean   :145.7   Mean   :0.3913   Mean   :6  
##  3rd Qu.:75.00   3rd Qu.:200.0   3rd Qu.:1.0000   3rd Qu.:6  
##  Max.   :81.00   Max.   :200.0   Max.   :2.0000   Max.   :6  
##     erosion           blowby           damage      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 0.000  
##  Median :0.0000   Median :0.0000   Median : 0.000  
##  Mean   :0.3478   Mean   :0.1739   Mean   : 1.435  
##  3rd Qu.:0.5000   3rd Qu.:0.0000   3rd Qu.: 3.000  
##  Max.   :3.0000   Max.   :2.0000   Max.   :11.000

The two variables we are most interested in are temp and a measure of damage. There are several variables that measure damage to the rockets so we will choose damage, a damage index. The damage index runs from 0 to 12.

The first step in the data analysis is to plot the variables. Save the plot result to a variable to reuse it in a later chunk.

challenger_plot_1 <- ggplot(Challeng, aes(x = temp, y = damage)) + 
  geom_point()

But print it so it displays

challenger_plot_1

There appears to be a negative relationship between temperature and damage (lower temperature, more damage).

Let’s run a regression of damage on temp. This is not the best model for this, but it is a good first cut.

challenger_model1 <- lm(damage ~ temp, data = Challeng)
summary(challenger_model1)
## 
## Call:
## lm(formula = damage ~ temp, data = Challeng)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3025 -1.4507 -0.4928  0.7397  5.5337 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.36508    4.43859   4.138 0.000468 ***
## temp        -0.24337    0.06349  -3.833 0.000968 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.102 on 21 degrees of freedom
## Multiple R-squared:  0.4116, Adjusted R-squared:  0.3836 
## F-statistic: 14.69 on 1 and 21 DF,  p-value: 0.0009677

However, the statistical significance of this relationship is not particularly useful in this case. What we want to know is what is the predicted damage for a temperature of 31 degrees.

predict(challenger_model1, newdata = data.frame(temp = 31))
##        1 
## 10.82052

At 31 degrees we would observe as much damage as had ever been observed before.

Let’s plot the fit of the linear regression.

challenger_plot_1 + geom_smooth(method = "lm")

This plot displays reasons why the linear model is a poor one; the predictions extend below 0 and the data don’t appear linear (exponential?).

Instead, let’s create a new variable called anydamage that is 1 if there is any damage and 0 if there was no damage.

Challeng <- mutate(Challeng, anydamage = as.integer(damage > 0))

Now there is an additional column (anydamage).

summary(Challeng)
##       temp            pres            fail              n    
##  Min.   :53.00   Min.   : 50.0   Min.   :0.0000   Min.   :6  
##  1st Qu.:67.00   1st Qu.: 50.0   1st Qu.:0.0000   1st Qu.:6  
##  Median :70.00   Median :200.0   Median :0.0000   Median :6  
##  Mean   :69.57   Mean   :145.7   Mean   :0.3913   Mean   :6  
##  3rd Qu.:75.00   3rd Qu.:200.0   3rd Qu.:1.0000   3rd Qu.:6  
##  Max.   :81.00   Max.   :200.0   Max.   :2.0000   Max.   :6  
##     erosion           blowby           damage         anydamage     
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median : 0.000   Median :0.0000  
##  Mean   :0.3478   Mean   :0.1739   Mean   : 1.435   Mean   :0.3043  
##  3rd Qu.:0.5000   3rd Qu.:0.0000   3rd Qu.: 3.000   3rd Qu.:1.0000  
##  Max.   :3.0000   Max.   :2.0000   Max.   :11.000   Max.   :1.0000

As before, the first thing we do is plot the data.

ggplot(Challeng, aes(x = temp, y = anydamage)) +
  geom_point() 

Instead of running a linear model, let’s run a logit model, which is a type of generalized linear model for binary data. This is run using the function glm.

challenger_model_2 <- glm(anydamage ~ temp, family = "binomial", 
                          data = Challeng)
summary(challenger_model_2)
## 
## Call:
## glm(formula = anydamage ~ temp, family = "binomial", data = Challeng)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0611  -0.7613  -0.3783   0.4524   2.2175  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  15.0429     7.3786   2.039   0.0415 *
## temp         -0.2322     0.1082  -2.145   0.0320 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 20.315  on 21  degrees of freedom
## AIC: 24.315
## 
## Number of Fisher Scoring iterations: 5

The coefficients are even less useful than the linear case.

But we can calculate the probability of any damage at a temperature of 31 degrees.

predict(challenger_model_2, newdata = data.frame(temp = 31), type = "response")
##         1 
## 0.9996088

The probability is over 99%!

We may be interested about how much uncertainty there was about this estimate. So we plot the fit of the model.

ggplot(Challeng, aes(x = temp, y = anydamage)) +
  geom_point() +
  geom_smooth(method = "glm", family = "binomial")

We analyzed the data in several ways. None of them suggest that launching at 31 degrees was a reasonable idea.

DO NOT LAUNCH!