Chapter 9 Classical Statistical Hypothesis Testing
This Chapter relates to Lecture 10, and contains some (re)worked examples from earlier sessions, focusing specifically on the hypothesis testing aspect of them.
9.1 Correlation Significance Tests
Here, we revisit our correlation between GDP per capita, and Happiness metrics, which I pulled from Our Word in Data:
## Country Happiness GDPpc Pop
## Length:249 Min. :2.400 Min. : 731 Min. :8.090e+02
## Class :character 1st Qu.:4.670 1st Qu.: 4917 1st Qu.:4.153e+05
## Mode :character Median :5.530 Median : 12655 Median :5.596e+06
## Mean :5.492 Mean : 20464 Mean :5.918e+07
## 3rd Qu.:6.260 3rd Qu.: 30100 3rd Qu.:2.421e+07
## Max. :7.820 Max. :112557 Max. :4.663e+09
## NA's :96 NA's :52 NA's :7
## # A tibble: 6 × 4
## Country Happiness GDPpc Pop
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 2.4 1971 38972236
## 2 Albania 5.2 13192 2866850
## 3 Algeria 5.12 10735 43451668
## 4 American Samoa NA NA 46216
## 5 Andorra NA NA 77723
## 6 Angola NA 6110 33428490
Let’s not worry about plotting the data, and go straight to the correlation:
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 13.502, df = 146, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6636288 0.8092331
## sample estimates:
## cor
## 0.745184
Here are our results. The estimate is a correlation, and we test that using the t statistic. The t-value is simply the estimate divided by the standard error (which we can’t see in this output), and is interpreted essentially as ‘how far from 0 is the estimate, in standard errors’.
The p-value for t is very very small, and obviously less than 0.05.
Conclusion - reject null hypothesis, accept alternative hypothesis (as always, pending better evidence).
Importantly, this does not mean that the true correlation in the population is 0.745, simply that it is very unlikely to be zero.
We can then look at our estimate of 0.745, and - even better - our confidence interval (see Section 8.2 for information on how to interpret confidence intervals), to gain some indication of the likely true correlation in the population.
9.2 Regression Significance Tests
The process to asses the significance of regression estimates is very very similar to that for correlations. Let’s revisit the heart disease data set we used earlier.
## ...1 biking smoking heart.disease
## Min. : 1.0 Min. : 1.119 Min. : 0.5259 Min. : 0.5519
## 1st Qu.:125.2 1st Qu.:20.205 1st Qu.: 8.2798 1st Qu.: 6.5137
## Median :249.5 Median :35.824 Median :15.8146 Median :10.3853
## Mean :249.5 Mean :37.788 Mean :15.4350 Mean :10.1745
## 3rd Qu.:373.8 3rd Qu.:57.853 3rd Qu.:22.5689 3rd Qu.:13.7240
## Max. :498.0 Max. :74.907 Max. :29.9467 Max. :20.4535
## # A tibble: 6 × 4
## ...1 biking smoking heart.disease
## <dbl> <dbl> <dbl> <dbl>
## 1 1 30.8 10.9 11.8
## 2 2 65.1 2.22 2.85
## 3 3 1.96 17.6 17.2
## 4 4 44.8 2.80 6.82
## 5 5 69.4 16.0 4.06
## 6 6 54.4 29.3 9.55
Let’s go straight the the multiple regression model.
##
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = Heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1789 -0.4463 0.0362 0.4422 1.9331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.984658 0.080137 186.99 <2e-16 ***
## biking -0.200133 0.001366 -146.53 <2e-16 ***
## smoking 0.178334 0.003539 50.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared: 0.9796, Adjusted R-squared: 0.9795
## F-statistic: 1.19e+04 on 2 and 495 DF, p-value: < 2.2e-16
We interpret these just as we did the correlation significance tests.
The t-value is large, and the p-value (two-tailed) is small.
Interestingly, here we are given ‘stars’ for the different levels of significance, so to some extent the software is doing some decision making for you. To be honest, I always caution against relying solely on looking for ‘stars’ (it’s actually a bit of a running joke that I once told an entire class in the 1990s to ‘just look at the stars’). That’s because the actual significance or not decision is based on the critical value and one- or two-tailed decision. The software often makes an assumption of 0.05 critical value for p, two-tailed, and calculates the ‘stars’ based on that. Sometimes that can conflict with the decision you have made yourself about what should be significant or not. That can trip you up if you didn’t know to change these values in the software package.
Further, it also sort of entrenches the idea that things can be ‘more’ or ‘less’ statistically significant. Take a look at the results above, you’ll see 3 stars represents a significance of ‘0’, and 2 stars represents ‘0.001’, boring old ‘0.05’ only gets a single star, and ‘0.1’ gets a dot. I’m not a fan here because this encourages the analyst to post-hoc make decisions about ‘marginally’ significant, or ‘very’ significant. These concepts do not exist. You decide your critical value, and you either pass or fail it.
Am I perfect? No. Specifically, do any of my papers use the language of ‘marginal’ significance? Sure, I bet you could find it. I am never a fan though, and I can promise you I argued about it at the time!