Chapter 9 Classical Statistical Hypothesis Testing

This Chapter relates to Lecture 10, and contains some (re)worked examples from earlier sessions, focusing specifically on the hypothesis testing aspect of them.

9.1 Correlation Significance Tests

Here, we revisit our correlation between GDP per capita, and Happiness metrics, which I pulled from Our Word in Data:

##    Country            Happiness         GDPpc             Pop           
##  Length:249         Min.   :2.400   Min.   :   731   Min.   :8.090e+02  
##  Class :character   1st Qu.:4.670   1st Qu.:  4917   1st Qu.:4.153e+05  
##  Mode  :character   Median :5.530   Median : 12655   Median :5.596e+06  
##                     Mean   :5.492   Mean   : 20464   Mean   :5.918e+07  
##                     3rd Qu.:6.260   3rd Qu.: 30100   3rd Qu.:2.421e+07  
##                     Max.   :7.820   Max.   :112557   Max.   :4.663e+09  
##                     NA's   :96      NA's   :52       NA's   :7
## # A tibble: 6 × 4
##   Country        Happiness GDPpc      Pop
##   <chr>              <dbl> <dbl>    <dbl>
## 1 Afghanistan         2.4   1971 38972236
## 2 Albania             5.2  13192  2866850
## 3 Algeria             5.12 10735 43451668
## 4 American Samoa     NA       NA    46216
## 5 Andorra            NA       NA    77723
## 6 Angola             NA     6110 33428490

Let’s not worry about plotting the data, and go straight to the correlation:

## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 13.502, df = 146, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6636288 0.8092331
## sample estimates:
##      cor 
## 0.745184

Here are our results. The estimate is a correlation, and we test that using the t statistic. The t-value is simply the estimate divided by the standard error (which we can’t see in this output), and is interpreted essentially as ‘how far from 0 is the estimate, in standard errors’.

The p-value for t is very very small, and obviously less than 0.05.

Conclusion - reject null hypothesis, accept alternative hypothesis (as always, pending better evidence).

Importantly, this does not mean that the true correlation in the population is 0.745, simply that it is very unlikely to be zero.

We can then look at our estimate of 0.745, and - even better - our confidence interval (see Section 8.2 for information on how to interpret confidence intervals), to gain some indication of the likely true correlation in the population.

9.2 Regression Significance Tests

The process to asses the significance of regression estimates is very very similar to that for correlations. Let’s revisit the heart disease data set we used earlier.

##       ...1           biking          smoking        heart.disease    
##  Min.   :  1.0   Min.   : 1.119   Min.   : 0.5259   Min.   : 0.5519  
##  1st Qu.:125.2   1st Qu.:20.205   1st Qu.: 8.2798   1st Qu.: 6.5137  
##  Median :249.5   Median :35.824   Median :15.8146   Median :10.3853  
##  Mean   :249.5   Mean   :37.788   Mean   :15.4350   Mean   :10.1745  
##  3rd Qu.:373.8   3rd Qu.:57.853   3rd Qu.:22.5689   3rd Qu.:13.7240  
##  Max.   :498.0   Max.   :74.907   Max.   :29.9467   Max.   :20.4535
## # A tibble: 6 × 4
##    ...1 biking smoking heart.disease
##   <dbl>  <dbl>   <dbl>         <dbl>
## 1     1  30.8    10.9          11.8 
## 2     2  65.1     2.22          2.85
## 3     3   1.96   17.6          17.2 
## 4     4  44.8     2.80          6.82
## 5     5  69.4    16.0           4.06
## 6     6  54.4    29.3           9.55

Let’s go straight the the multiple regression model.

## 
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = Heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1789 -0.4463  0.0362  0.4422  1.9331 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.984658   0.080137  186.99   <2e-16 ***
## biking      -0.200133   0.001366 -146.53   <2e-16 ***
## smoking      0.178334   0.003539   50.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared:  0.9796, Adjusted R-squared:  0.9795 
## F-statistic: 1.19e+04 on 2 and 495 DF,  p-value: < 2.2e-16

We interpret these just as we did the correlation significance tests.

The t-value is large, and the p-value (two-tailed) is small.

Interestingly, here we are given ‘stars’ for the different levels of significance, so to some extent the software is doing some decision making for you. To be honest, I always caution against relying solely on looking for ‘stars’ (it’s actually a bit of a running joke that I once told an entire class in the 1990s to ‘just look at the stars’). That’s because the actual significance or not decision is based on the critical value and one- or two-tailed decision. The software often makes an assumption of 0.05 critical value for p, two-tailed, and calculates the ‘stars’ based on that. Sometimes that can conflict with the decision you have made yourself about what should be significant or not. That can trip you up if you didn’t know to change these values in the software package.

Further, it also sort of entrenches the idea that things can be ‘more’ or ‘less’ statistically significant. Take a look at the results above, you’ll see 3 stars represents a significance of ‘0’, and 2 stars represents ‘0.001’, boring old ‘0.05’ only gets a single star, and ‘0.1’ gets a dot. I’m not a fan here because this encourages the analyst to post-hoc make decisions about ‘marginally’ significant, or ‘very’ significant. These concepts do not exist. You decide your critical value, and you either pass or fail it.

Am I perfect? No. Specifically, do any of my papers use the language of ‘marginal’ significance? Sure, I bet you could find it. I am never a fan though, and I can promise you I argued about it at the time!