Chapter 5 Introduction to Bootstrapping

This Chapter supplements the material in the first part of Lecture 7.

In this chapter, we will learn the core concepts of bootstrapping. That is, creating synthetic sampling distributions through multiple resampling (with replacement) of a single sample.

The basic process is fairly simple, once you have your original sample, and has the following characteristics:

  1. A bootstrap sample has an equal probability of randomly drawing any of the original sample elements (data points).
  2. Each element can be selected more than once - because the sample is done with replacement.
  3. Each resampled data set (the new sample) is the same size as the original one.

First, I will demonstrate the basic principle.

Recall from the last chapter that there was a simulated data set of 498 people, with variables representing smoking, biking, and heart disease.

We treated this as the population and then sampled from it to demonstrate uncertainty at different sample sizes.

So, let’s take the sample of 50 that we drew from that ‘population’ of 498, and imagine that we got this sample by (for example) doing a survey of the population (of 498), and this is our data set for analysis.

First we will remind ourselves of the properties and distribution of our sample, the median, mean, and distribution:

## [1] 17.4127
## [1] 16.46211

Ok, there we go. Remember, this sample is our ‘data set’ for analysis.

We know the median and mean, but we have no real indication of the uncertainty in those estimates. In order to get that information, we will eventually use the bootstrap method. For now, we are just going to demonstrate the basic idea.

So, what we do now is draw another random sample of 50 from this 50, but each time we draw a data point, we replace it back, so we are always drawing our sample from the full 50. This is called sampling with replacement.

In this way, the new sample can only contain values which were in the original sample, but can contain different frequencies of those values. Or in other words, each value can occur many different times, and that number of times may be different to the original sample. So the distribution of values in this new sample will be different to the original sample, and the statistics will therefore also be different.

Let’s draw this new sample, take the median and mean of the sample, and plot it:

## # A tibble: 50 × 4
##     ...1 biking smoking heart.disease
##    <dbl>  <dbl>   <dbl>         <dbl>
##  1   301  47.4    15.3           9.10
##  2    86  15.1    16.6          14.9 
##  3   430   6.32    5.21         14.8 
##  4     3   1.96   17.6          17.2 
##  5    47  20.8    19.5          14.4 
##  6   482  33.3     6.81         10.4 
##  7   413  30.1    29.3          15.0 
##  8   469  11.7    26.7          17.5 
##  9   223  49.7     3.49          5.07
## 10   342  33.3    17.3          11.3 
## # ℹ 40 more rows
## [1] 16.96012
## [1] 15.56018

Marvelous! Now, for the purpose of example, let’s draw two more of these resamples from the original 50, take their median and mean, and plot the distributions…

## # A tibble: 50 × 4
##     ...1 biking smoking heart.disease
##    <dbl>  <dbl>   <dbl>         <dbl>
##  1   399  71.7    0.942         0.555
##  2   287  36.2   20.9          12.8  
##  3   122  46.6   26.9          10.5  
##  4   307  17.8    3.48         11.4  
##  5   109  49.7   20.5           8.70 
##  6   430   6.32   5.21         14.8  
##  7   346   1.58   9.06         14.4  
##  8   355  70.0   20.3           3.69 
##  9   217  23.1    0.907        10.7  
## 10   100  12.6   17.0          14.9  
## # ℹ 40 more rows
## [1] 15.22463
## [1] 14.43376

## # A tibble: 50 × 4
##     ...1 biking smoking heart.disease
##    <dbl>  <dbl>   <dbl>         <dbl>
##  1   342  33.3    17.3          11.3 
##  2   346   1.58    9.06         14.4 
##  3   332  51.6    23.3           8.55
##  4   332  51.6    23.3           8.55
##  5   223  49.7     3.49          5.07
##  6   333  74.0    18.9           3.30
##  7   430   6.32    5.21         14.8 
##  8    47  20.8    19.5          14.4 
##  9   283  67.8    26.9           6.26
## 10   430   6.32    5.21         14.8 
## # ℹ 40 more rows
## [1] 16.80249
## [1] 16.43715

Now, if we return to the slides, we can build a table using these mean and median values. Of course, the slide deck will have slightly different values, since it’s based on a different run of the resampling process, but the principle is the same.

So, this is the basic idea of bootstrapping. We sample with replacement from our original sample, many many times. We did 3 here manually, but we generally use a program to do this many more times, such as a thousand or more.

5.1 Bootstrapping in the Context of Previous Examples

To further reinforce the point, let’s now place ourselves in the position of three different researchers, each of varying levels of enthusiasm, and all three are researching the same population of 498 people that we have already explored in the last few examples.

Researcher 1 is a little like me as a Ph.D. student, and maybe more interested in ‘experiencing life’. So, he has little time to actually collect data, and not much more enthusiasm for it. In the end, he manages to take a sample of 10 people from the population of 498.

Researcher 2 is a bit more enthusiastic, and gets a sample of 50.

Researcher 3 is fairly conscientious, and takes a sample of 200 from the population of 498.

Now, what we can do, is run 1000 bootstrap replications of each of these varying-sized subsamples of the population, to see what might happen:

First, the 10:

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = sub.10, statistic = f1, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 16.98566 -0.01258974    2.253984

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "norm")
## 
## Intervals : 
## Level      Normal        
## 95%   (12.58, 21.42 )  
## Calculations and Intervals on Original Scale

Now let’s do it for the other two subsamples of n=50, and n=200

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = sub.50, statistic = f1, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1* 16.46211 0.07765874    1.180578

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "norm")
## 
## Intervals : 
## Level      Normal        
## 95%   (14.07, 18.70 )  
## Calculations and Intervals on Original Scale
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = sub.200, statistic = f1, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 15.78003 0.005737106   0.5777123

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "norm")
## 
## Intervals : 
## Level      Normal        
## 95%   (14.64, 16.91 )  
## Calculations and Intervals on Original Scale

We’ll now build a table with these values back in the slide deck. IAgain, remember the values in the slide deck will differ from these due to the randomness of the process.

Now, let’s shift our minds a bit, and consider that the data set of 498 actually represents a sample of a larger population (remember from the last chapter, it’s simulated, but meant to represent a sample from the population).

So, let’s bring in Researcher 4, the most conscientious of all. She is the one who manages to take a sample of 498 people from the population. And, finally, we can bootstrap the original full sample of 498:

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Heart, statistic = f1, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 15.43503 0.001437866    0.359051

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "norm")
## 
## Intervals : 
## Level      Normal        
## 95%   (14.73, 16.14 )  
## Calculations and Intervals on Original Scale

This is a very nice set of results, which can tell us many interesting things. So let’s go back to the slides…..

5.2 Bootstrapping Other Stuff…

We have so far only bootstrapped the mean. However, the basic principle can be applied to virtually any statistical estimate. So, we can revisit some of our prior analyses, and use the bootstrap method to quantify the uncertainty in the estimates that we previously accepted without really thinking too hard about them.

5.2.1 Correlations

First, let’s revisit our recent correlation analysis of Happiness and GDP per capita.

## # A tibble: 6 × 4
##   Country        Happiness GDPpc      Pop
##   <chr>              <dbl> <dbl>    <dbl>
## 1 Afghanistan         2.4   1971 38972236
## 2 Albania             5.2  13192  2866850
## 3 Algeria             5.12 10735 43451668
## 4 American Samoa     NA       NA    46216
## 5 Andorra            NA       NA    77723
## 6 Angola             NA     6110 33428490
##           vars   n        mean           sd     median     trimmed        mad
## Country*     1 249      125.00        72.02     125.00      125.00      91.92
## Happiness    2 153        5.49         1.12       5.53        5.52       1.16
## GDPpc        3 197    20463.88     20717.34   12655.00    17037.01   13338.95
## Pop          4 242 59178643.60 331869505.09 5596196.00 12318073.38 8185922.38
##             min          max        range  skew kurtosis          se
## Country*    1.0 2.490000e+02 2.480000e+02  0.00    -1.21        4.56
## Happiness   2.4 7.820000e+00 5.420000e+00 -0.26    -0.38        0.09
## GDPpc     731.0 1.125570e+05 1.118260e+05  1.58     2.55     1476.05
## Pop       809.0 4.663087e+09 4.663086e+09 11.65   152.44 21333379.77

If we run the same analysis as in Chapter 2, we’ll get the same results: Correlation R = 0.75

Now, let’s take uncertainty into account, by bootstrapping that correlation and creating some confidence intervals.

## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.689    0.811
## Response: Happiness (numeric)
## Explanatory: GDPpc (numeric)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1 0.745

So, you can see the correlation is 0.75 with a 95% confidence interval of 0.69 - 0.81

Now, let’s extend this to the multiple regression case we have previously used, examining the relationships between smoking, biking, and heart disease.

## # A tibble: 6 × 4
##    ...1 biking smoking heart.disease
##   <dbl>  <dbl>   <dbl>         <dbl>
## 1     1  30.8    10.9          11.8 
## 2     2  65.1     2.22          2.85
## 3     3   1.96   17.6          17.2 
## 4     4  44.8     2.80          6.82
## 5     5  69.4    16.0           4.06
## 6     6  54.4    29.3           9.55
##               vars   n   mean     sd median trimmed    mad  min    max  range
## ...1             1 498 249.50 143.90 249.50  249.50 184.58 1.00 498.00 497.00
## biking           2 498  37.79  21.48  35.82   37.71  27.51 1.12  74.91  73.79
## smoking          3 498  15.44   8.29  15.81   15.47  10.86 0.53  29.95  29.42
## heart.disease    4 498  10.17   4.57  10.39   10.18   5.42 0.55  20.45  19.90
##                skew kurtosis   se
## ...1           0.00    -1.21 6.45
## biking         0.07    -1.22 0.96
## smoking       -0.04    -1.12 0.37
## heart.disease -0.03    -0.93 0.20

Here, we need to calculate multiple confidence intervals as we have multiple estimates.

## # A tibble: 3 × 2
##   term      estimate
##   <chr>        <dbl>
## 1 intercept   15.0  
## 2 smoking      0.178
## 3 biking      -0.200
## # A tibble: 3 × 3
##   term      lower_ci upper_ci
##   <chr>        <dbl>    <dbl>
## 1 biking      -0.203   -0.197
## 2 intercept   14.8     15.1  
## 3 smoking      0.171    0.186

It’s worth reflecting on exactly what these conflidence intervals mean, and to do so, we can move back to the slides…