Chapter 4 Beginning to Understand Uncertainty

This Chapter relates to lecture 6.

In this chapter I’ll introduce core concepts around uncertainty in our results. Understanding that the results of our analysis always contain some level of uncertainty is probably the most critical concept to get our heads around as quantitative social scientists. Most of our job is not really coming up with the actual statistics, such as the correlation coefficient, or regression beta, but is more about understanding how to interpret and use those results - i.e. what they mean. And, fundamental to that is understanding their uncertainty.

Again, to reiterate the message I sent in class, many of the examples in this Chapter, and later ones, involve randomness. This means that the results here may be slightly different numerically to the results in the slides. And, if you were to run these examples yourself, you would also get slightly different results. This is nothing to worry about, because the meaning of the results does not change.

So, to start the journey, let’s grab some data.

Here, we will again use the simple three-variable set of simulated data, which represents rates of smoking, rates of cycling, and heart disease incidence.

## # A tibble: 6 × 4
##    ...1 biking smoking heart.disease
##   <dbl>  <dbl>   <dbl>         <dbl>
## 1     1  30.8    10.9          11.8 
## 2     2  65.1     2.22          2.85
## 3     3   1.96   17.6          17.2 
## 4     4  44.8     2.80          6.82
## 5     5  69.4    16.0           4.06
## 6     6  54.4    29.3           9.55

Rather than do the full ‘describe’ as I did in the last chapter, I have simply above looked at what is called the ‘head’ of the data set or the first few rows. This is because all I want to do here is double check that I have the data, and what variables are there.

Let’s calculate some simple summary statistics from this data set to build on. For example, what is the mean and median for ‘smoking’?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5259  8.2798 15.8146 15.4350 22.5689 29.9467

Now, we know this is really simulated data, but let’s imagine for now that it was actually obtained by an organization like the Office for National Statistics in the UK, using a survey. We can presume the study was done well, and thus it is based on a true random sampling method, and we assume that the study population matches whatever target population we have in mind (remember the ‘inference gaps’ discussed in class).

What we really want to know is, how close are these statistics (i.e. the mean and median) to the true population values that we would have found if we could survey the entire target population?

Let’s begin to think about this by starting to build a table using these statistics, by going back to the slide deck…

4.1 Demonstration: Sampling from a ‘Known’ Population

Now, let’s go back one more step, and demonstrate the uncertainty inherent to sample statistics by way of example.

Let’s now assume that this sample of 498 people actually is the population we are interested in.

What this means is, we can actually draw a sample from this population of 498 and see what happens.

First, let’s present the distribution for the entire ‘population’ of 498.

Now, let’s literally take a sample of 10 random cases from that population of 498. Here, we are sampling without replacement, and are thus essentially doing exactly what a hypothetical ‘researcher’ would do if they drew a random sample of 10 people to complete their survey, from the population of 498.

## # A tibble: 10 × 4
##     ...1 biking smoking heart.disease
##    <dbl>  <dbl>   <dbl>         <dbl>
##  1    18  33.9     5.76          9.16
##  2   405  12.9    27.0          18.6 
##  3   275  32.2     5.02          9.90
##  4   288  45.5    25.8          10.9 
##  5   111  15.2    14.5          14.6 
##  6   256  73.1    11.3           1.69
##  7   241   6.28   20.2          17.6 
##  8    92  26.9    16.7          13.5 
##  9   248  29.7    23.3          13.0 
## 10   355  70.0    20.3           3.69

Next, let’s look at the relevant statistics (median and then mean) and distribution of this sample of 10:

## [1] 18.44737
## [1] 16.98566

We can do the same for successively larger samples, say 50, and 200:

## # A tibble: 50 × 4
##     ...1 biking smoking heart.disease
##    <dbl>  <dbl>   <dbl>         <dbl>
##  1   355  70.0   20.3           3.69 
##  2   109  49.7   20.5           8.70 
##  3   399  71.7    0.942         0.555
##  4   135  70.3   26.2           5.24 
##  5   218  65.1    1.49          2.19 
##  6   346   1.58   9.06         14.4  
##  7   369  39.9   14.5          10.0  
##  8   323  56.4   17.5           6.81 
##  9   341  46.0   18.1           9.23 
## 10    47  20.8   19.5          14.4  
## # ℹ 40 more rows
## [1] 17.4127
## [1] 16.46211

## # A tibble: 200 × 4
##     ...1 biking smoking heart.disease
##    <dbl>  <dbl>   <dbl>         <dbl>
##  1    65   42.3    8.41          7.46
##  2    48   46.6    9.25          6.81
##  3   425   21.8   15.5          13.4 
##  4   374   61.5   17.0           5.91
##  5   467   74.5   22.5           4.30
##  6   161   30.5    2.53          8.94
##  7   408   50.8   20.0           7.93
##  8   417   45.9    2.62          5.81
##  9    27   60.5    3.98          3.22
## 10   336   14.7   19.5          14.4 
## # ℹ 190 more rows
## [1] 15.98487
## [1] 15.98487

As you can see, the distributions of the smaller samples are more peaky and bumpy, because they are very sensitive to individual data points. As the sample gets larger, it starts to look more like the population right?

We can complete our table now in the slides of the sample statistics (median and mean) showing that in general, as we get closer to the population size, the statistics generally get closer too. To do so, let’s go back to the slides…