Learning objectives, technical skills and concepts:

Bootstrap
Apply functions
Building custom functions
Resampling

Bootstrap

We’re going to take a brief detour to continue our focus on resampling methods.

This week’s lab we’ll work through materials covering two applications of bootstrap methods.

The material on bootstrapping techniques in this section was adapted from laboratory materials and data for ECo 634, Analysis of Environmental Data Lab, written by Kevin McGarigal at the University of Massachusetts Amherst Department of Environmental Conservation.

The `apply()` family of functions

The apply philosophy

There exists a strong anti-loop school of thought within the R community.

The apply functions attempt to replace the need for for-loops by providing a function-based set of alternatives.

`apply()`

To demonstrate the apply philosophy, I’ll use the apply() function.

apply() works on data frames or other 2-dimensional arrays of data, such as matrices.

It applies a function to either the rows or columns of the input data.

The important arguments are:

X the 2-dimensional data, usually a data frame or a matrix
MARGIN whether to apply the function to rows (MARGIN = 1) or columns (MARGIN = 2)
FUN The function to apply to the rows or columns

Basic examples

Here are some demos using a matrix:

# Create simulated data
dat = matrix(1:49, nrow = 7, ncol = 7)
print(dat)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    1    8   15   22   29   36   43
## [2,]    2    9   16   23   30   37   44
## [3,]    3   10   17   24   31   38   45
## [4,]    4   11   18   25   32   39   46
## [5,]    5   12   19   26   33   40   47
## [6,]    6   13   20   27   34   41   48
## [7,]    7   14   21   28   35   42   49

Minimum and maximum values in each row

apply(dat, MARGIN = 1, FUN = min)

## [1] 1 2 3 4 5 6 7

apply(dat, MARGIN = 1, FUN = max)

## [1] 43 44 45 46 47 48 49

Note that I used MARGIN = 1 to indicate that I wanted apply() to look at rows.
I used FUN = min and FUN = max to tell apply() to find the minimum and maximum/ values in the rows.
I did not surround the function names by quotes.

Mean values in each column

apply(dat, MARGIN = 2, FUN = mean)

## [1]  4 11 18 25 32 39 46

Note that I used MARGIN = 1 to indicate that I wanted apply() to look at columns
I used FUN = mean to tell apply() to find the mean value in each column

More advanced usage

You can use the apply family of functions to accomplish most tasks that you could use loops for.

I find that sometimes apply functions are simpler, while sometimes it is easier to use a for-loop. It often depends on the complexity of the task.

If the task requires several steps, and perhaps some intermediate variables, a for-loop is often better.

Here’s a DataCamp tutorial on apply functions you can check out for some more examples:

https://www.datacamp.com/community/tutorials/r-tutorial-apply-family

Data Files

We’ll work with a dataset containing standardized abundance estimates of 10 rare moth species across 24 sample plots in the pine barrens of southeast Massachusetts.

You’ll need the moths.csv file for the data.

moths = read.csv(here("data", "moths.csv"))
head(moths)

##   site itame mepi cime  anst cahe  bare   acal   apde psca euma
## 1    4     0    0    0 1.000  1.0 0.000  0.667  2.500  0.0 0.00
## 2    5     0    0    0 1.000  0.0 0.000  0.000  0.000  1.5 0.00
## 3    6     0    2    0 0.000  2.5 0.000  0.667 10.000  0.0 0.00
## 4    7     0    0    3 5.000  0.0 1.667  0.500  2.333  1.5 0.00
## 5    8     0    0    0 0.000  4.5 0.000  0.667  4.000  3.0 0.75
## 6    9     5    0    0 6.333  1.0 1.667 16.667  1.333  7.0 2.75

The Bootstrap

In a statistical context, bootstrap refers to a resampling technique by with which we use sampling with replacement on a single sample of observations to simulate the process of taking many samples from a population.

The term bootstrap comes from the old saying about ‘pulling yourself up by your own bootlaces’.

The phrase means getting ‘something for nothing’.

Bootstrap and Monte-Carlo resampling

These two resampling techniques can help us learn something about the null and alternative hypotheses.

Bootstrapping creates a kind of alternative distribution, while Monte-Carlo resampling can characterize a null distribution.

The alternative distribution corresponds to a range of possibilities given associations in the data we actually observed.
The null distribution tells us something about what we could expect to see if there were no associations.

Bootstrap resampling does not break the associations in the data.

This is because in bootstrap resampling, in contrast to Monte-Carlo resampling, we sample entire rows of data.

One way to think of the difference is that:

Monte-Carlo resampling shuffles each column separately, creating new combinations of attribute values that were not observed in the original data.
Bootstrapping samples an entire row; it does not create new data.

The Scenario

You want to know something about a larger population, so you have taken a single sample of n measurements.

You would like to be able to take many more samples, but you don’t have the resources for repeated sampling efforts
Although you have only a single sample, you could resample data from the sample in many ways if you use sampling with replacement.
Sampling with replacement allows some values to appear more than once, and other samples to be left out.

Resampling with replacement:

You can draw observations at random from the original sample, replacing each observation after it is selected so that it has the same chance of being drawn on the next draw.
By doing this repeatedly, you can create a new data set by resampling the original data set.
It’s not quite as good as being able to resample the population, but it is a powerful technique nonetheless.

One of the most important uses of the bootstrap is calculating non-parametric confidence intervals.

In this context, the bootstrap simulates the Frequentist ideal of obtaining estimates from repeated similar experiments.

Moth Abundance: CIs

To demonstrate bootstrap confidence intervals, we’ll work with a dataset containing standardized abundance estimates of 10 rare moth species across 24 sample plots in the pine barrens of southeast Massachusetts.

For now, let’s focus on just one species, the spiny oakworm, Anisota stigma, abbreviated ‘anst’, and examine the distribution of its abundance with a histogram:

The standardized abundances appear highly non-normal. Techniques that assume normality probably aren’t a good choice for these data.

Let’s say we want to obtain a 95% confidence interval for the mean standardized abundance. We’ll try both parametric and non-parametric approaches.

A Parametric Confindence Interval

Recall that the central limit theorem states that upon repeated sampling, the sample mean will be approximately normally distributed, subject to certain assumptions which may not be met here.

Recall also that we can think of the t distribution as a standard normal distribution that has been adjusted for a sample size.

The t-distribution has heavy tails compared to the normal. These tails represent the increased uncertainty when we’re working with a finite sample.
The t-distribution is (usually) described by a single parameter: the degrees of freedom (DF). The DF is just the sample size minus 1.

Using these facts, we can use a t-distribution to construct a parametric confidence interval.

Should I use a t-distribution or a Standard Normal Distribution for my CI calculations?

In general, it is always safer (and more conservative) to use the t-distribution.
A good rule of thumb is that if your sample has 30 or fewer individuals you should use a t-distribution. For larger degrees of freedom greater than 30, the difference between a t-distribution and the Standard Normal Distribution is negligible.

For the purposes of the calculations in this lab, you should use a t-distribution.

General procedure to calculate a CI (click to show/hide)

To calculate a parametric CI for a sample, the general procedure is:

Choose your significance level (usually 95%).
- The value of $alpha$ is just 1 - significance level.
Calculate the sample standard error.
- You’ll need the sample standard deviation and the sample size to do this calculation.
Using your chosen value of $alpha$, calculate the critical t-values.
Multiply the sample standard error by the critical t-value. This is the radius of the CI.
Express the CI as $mean \pm radius$

Calculating the Parametric CI

# Choose significance level
alpha = 0.05

# 2: Calculate sample standard error:
n = sum(!is.na(moths$anst))
sse = sd(moths$anst, na.rm = TRUE) / sqrt(n)

# 3: Calculate critical t-values:
t_crit = abs(qt(alpha / 2, df = n - 1))

# 4: Calculate the CI radius:
ci_radius = sse * t_crit

# The CI is the sample mean +/- the radius:
anst_ci = c(
  lower = mean(moths$anst) - ci_radius,
  upper = mean(moths$anst) + ci_radius)

print(round(anst_ci, 4))

##  lower  upper 
## 0.9358 4.0046

The CI:

technique	mean	CI radius	lower	upper
parametric: t-dist	2.4702	1.5344	0.9358	4.0046

So the parametric CI is $2.47\pm1.53$

This is the standard way to construct a confidence interval for the mean and test the hypothesis that the mean differs from 0.

If the confidence interval does not contain 0, then we can say that the mean is unlikely to have come from a distribution with a mean of 0. In other words, the population mean is likely to be different than zero.

A Simple Bootstrap Confidence Interval

If we don’t want to use a parametric approach, we can use bootstrapping.

The bootstrap CI would likely be a more robust estimate since we have a small sample size (n = 24) and we do not know whether the underlying population is normally-distributed.

We can perform a simple bootstrap simulation by calculating the mean abundance on many randomly resampled (with replacement) data sets.

Let’s try a set of 10000:

The procedure will be:

Create an empty vector to hold the bootstrap sample means
Create the resampled data sets and calculate the means
Calculate the CI from the quantiles of the resulting bootstrap means

Calculating the Bootstrap CI

Create an empty results vector:

m = 10000

# numeric() creates an vector of length m with all values initialized to zero
result = numeric(m)
head(result)

## [1] 0 0 0 0 0 0

Perform the bootstrap

for(i in 1:m)
{
  result[i] = mean(sample(moths$anst, replace=TRUE))
}

Calculate the quantiles

The vector result now contains 10,000 bootstrap sample means. We can calculate the mean of the bootstrap means and, more importantly, the 2.5% and 97.5% quantiles of the bootstrap distribution, as follows:

mean(result)

## [1] 2.470567

quantile(result,c(0.025,0.975))

##     2.5%    97.5% 
## 1.211056 4.005499

Compare the CIs

How does this compare with parametric confidence interval?

technique	mean	CI radius	lower	upper
parametric: t-dist	2.4702	1.5344	0.9358	4.0046
simple boot	2.4862	NA	1.2500	4.0433

Close, but not identical.

Our bootstrap confidence intervals are skewed because the data are skewed, whereas the parametric confidence interval is symmetric (they usually are) because it’s based upon the sampling distribution (which is approximately Normal per the Central Limit Theorem).

Bootstrap Interval Using `boot()`

The boot packages includes functions to perform bootstrap resampling.
You can install the package using:

install.packages("boot")

The basic syntax of boot is very simple:

require(boot)
boot(data, statistic, R)

A couple of things to note:

R is the number of resamplings you want to perform
data is the data object you want to resample. It can be a vector, matrix, or data.frame.
statistic is a function that returns the statistic of interest. You don’t put quotes around the name of the function.
There are restrictions on the order and types of arguments that this function must take.
We’ll usually need to write a custom function to meet the restrictions.
Be sure to check out the help entry. It’s a difficult read, so hopefully an example will illustrate the process:

Custom Mean Function

We want to calculate to mean, but we know that mean() can sometimes have trouble with NA values.

To use it with boot() we have to create a modified version of the mean() function that automatically excludes NAs:

boot_mean = function(x, i)
{
  return(mean(x[i], na.rm = TRUE))
}

The modified function is needed because:

Per the help entry for boot(), the first argument to our statistic function has to be the input data.
- In this case, our data are a vector of numbers. I’ve called the argument x in the custom function.
The second argument is an index (a vector of subscripts) that is used within boot() to select random assortments of x.

The key point is that we have to use our boot_mean() function, rather than mean() within our call to boot().

Now we can find the bootstrap for 10000 iterations:

myboot = 
  boot(
    data = moths$anst,
    statistic = boot_mean,
    R = 10000)
print(myboot)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = moths$anst, statistic = boot_mean, R = 10000)
## 
## 
## Bootstrap Statistics :
##     original       bias    std. error
## t1* 2.470208 -0.003769333   0.7236465

The output is interpreted as follows:

original is the original mean of the whole sample: mean(moths$anst).
bias is the difference between the original mean and the mean of the bootstrapped samples.
std.error is the standard deviation of the simulated values.

You can check which other attributes are available for retrieval using the dollar sign with str()

str(myboot)

## List of 11
##  $ t0       : num 2.47
##  $ t        : num [1:10000, 1] 3.2 2.47 3.71 2.94 2.42 ...
##  $ R        : num 10000
##  $ data     : num [1:24] 1 1 0 5 0 ...
##  $ seed     : int [1:626] 10403 523 -1028264899 -1446633906 570667598 -1007258268 94510828 1194839117 1675950064 -505447064 ...
##  $ statistic:function (x, i)  
##   ..- attr(*, "srcref")= 'srcref' int [1:8] 1 13 4 1 13 1 1 4
##   .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x00000294bbfa2a38> 
##  $ sim      : chr "ordinary"
##  $ call     : language boot(data = moths$anst, statistic = boot_mean, R = 10000)
##  $ stype    : chr "i"
##  $ strata   : num [1:24] 1 1 1 1 1 1 1 1 1 1 ...
##  $ weights  : num [1:24] 0.0417 0.0417 0.0417 0.0417 0.0417 ...
##  - attr(*, "class")= chr "boot"
##  - attr(*, "boot_type")= chr "boot"

mean(moths$anst)
myboot$t0
mean(myboot$t) - myboot$t0
sd(myboot$t)

## [1] 2.470208
## [1] 2.470208
## [1] -0.003769333
## [1] 0.7236465

Lastly, we can extract our bootstrap confidence interval as follows:

quantile(
  myboot$t,
  c(0.025, 0.975))

##     2.5%    97.5% 
## 1.204500 4.005612

Note that you’ll get slightly different quantities each time since the process involves random numbers

Boot CI Comparison

How do our intervals compare to each other?

technique	mean	CI radius	lower	upper
parametric: t-dist	2.4702	1.5344	0.9358	4.0046
simple boot	2.4634	NA	1.2083	4.0117
boot package	2.4697	NA	1.2167	4.0620

Graphically:

##            technique     mean ci_radius     lower    upper
## 1 parametric: t-dist 2.470208  1.534403 0.9358057 4.004611
## 2        simple boot 2.470567        NA 1.2110563 4.005499
## 3       boot package 2.466439        NA 1.2045000 4.005612

Rarefaction Curve

The principle behind rarefaction is that the number of species detected is related to sampling intensity:

The greater the sampling intensity (e.g., number of sample plots or sampling area) the greater the species richness.

This is another way to think of the the well-known species-area relationship in which area is replaced by sampling effort.

A rarefaction curve shows the expected species richness as a function of sampling intensity.

For example, a rarefaction curve for detecting all 10 rare moth species as a function of number of sampling sites that I computed looked like:

Note how jagged it looks. I could probably make it smoother with more replicates. For this plot I only used 10, but you’ll use a much larger number below.

We can use the bootstrap curve to calculate not only the expected number of species (species richness), but also a confidence envelope:

Setting up the bootstrap

Let’s begin with our moth data set by first removing the first column which represents an arbitrary site id and is not useful here.

Next, let’s create some objects to hold some quantities and a data matrix to make the subsequent script more compact:

moth_dat = moths[,-1]
head(moth_dat)

##   itame mepi cime  anst cahe  bare   acal   apde psca euma
## 1     0    0    0 1.000  1.0 0.000  0.667  2.500  0.0 0.00
## 2     0    0    0 1.000  0.0 0.000  0.000  0.000  1.5 0.00
## 3     0    2    0 0.000  2.5 0.000  0.667 10.000  0.0 0.00
## 4     0    0    3 5.000  0.0 1.667  0.500  2.333  1.5 0.00
## 5     0    0    0 0.000  4.5 0.000  0.667  4.000  3.0 0.75
## 6     5    0    0 6.333  1.0 1.667 16.667  1.333  7.0 2.75

n = nrow(moth_dat) #number of rows or sample observations
m = 100 #number of bootstrap iterations
moth_result = matrix(
  nrow = m,
  ncol = n)

We can organize our output as follows:

Each iteration of the bootstrap produces one row of output
The output columns correspond to sampling intensity.

In this case, we will need a row for each bootstrap iteration and a column for each sampling intensity – which can range from a single observation to the full data set (n):

Next, we can use a few tricks to complete the analysis

First, we need to draw a bootstrap sample from the data set of a specified size.

We can do this using indexing and sample() for the row index of our data frame:

data[sample(...), ]

This says to select the rows from data corresponding to the result of the sample() function.

Inside sample() we need to specify a vector containing a list of numbers ranging from 1 to n, the size of the bootstrap sample (i.e., number of row observations to take), and the replace=TRUE to ensure sampling with replacement.

Once we have drawn a bootstrap sample, we compute the species richness of the sample.

We can do this using apply() function to sum by column (species) and then count how many species have sums > 0 (indicating presence).

We need to store the result in the appropriate location in the result matrix we created above.

Finally, we need to put this whole set of operations into a double loop.

The inside loop will create bootstrap samples of size 1 to n; the outside loop will iterate through 10,000 bootstrap iterations.

Running the bootstrap simulation

This is what it looks like all together:

n = nrow(moth_dat) #number of rows or sample observations

m = 100 #number of bootstrap iterations

moth_result = matrix(
  nrow = m,
  ncol = n)


# The outer loop: runs once for each bootstrap iteration.  index variable is i
for(i in 1:m)
{
  # The inner loop: simulates increasing sampling intensity
  # Sampling intensity ranges from 1 site to the complete count of sites (24)
  # index variable is j
  for(j in 1:n)
  {
    # sample the input data row indices, with replacement
    rows_j = sample(n, size = j, replace=TRUE)

    # Creates a new data matrix from the resampled rows.
    t1 = moth_dat[rows_j, ]

    # Calculates the column sums of the new data matrix.
    t2 = apply(t1, 2, sum)

    # Counts the number of columns in which any moths were observed
    moth_result[i, j] = sum(t2 > 0)
  }
}

head(moth_result)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
## [1,]    4    6    7    8   10   10   10   10   10    10    10    10    10    10    10    10    10
## [2,]    8   10    8    5    9    8    9   10   10    10    10    10    10    10    10     9    10
## [3,]    2    7    6    8   10   10    9    9   10     9     9    10    10     9    10    10    10
## [4,]    2    5   10    8    8   10    9   10   10    10    10    10    10    10    10    10     9
## [5,]    3    6   10    9    8    9    8    9   10     8    10    10    10    10    10    10    10
## [6,]    2    6   10    9    9   10    8   10   10    10    10    10     9    10     9    10    10
##      [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,]    10    10    10    10    10    10    10
## [2,]    10    10    10    10    10    10    10
## [3,]    10    10    10    10    10    10    10
## [4,]    10    10    10    10    10    10    10
## [5,]    10    10    10    10    10    10    10
## [6,]    10    10    10    10    10    10    10

Some things to note:

Each row in the data frame corresponds to a single simulation (the outer loop).
Each column represents a different level of sampling effort (the inner loop).
The values in each cell are the number of species observed.

Packaging your code into a function

Whenever you find yourself writing complex loops or other blocks of code, you should consider whether you might want to re-use it elsewhere.

Wrapping up code into a function is a great way to facilitate re-use and tinkering.

We took some care to build the loop, so let’s take just a few more minutes to make a portable version. I should ask myself:

What do I want my function to do?
1. Execute the double loop.
2. Return the results in a matrix.
What input does my function need to know about?
1. The data to resample: a data.frame.
2. The number of bootstrap iterations

First Draft

In my first draft, I’ll just copy all of the code inside the function and propose argument names based on the function inputs I identified above:

rarefaction_sampler = function(input_dat, n_iterations)
{
  n = nrow(moth_dat) #number of rows or sample observations
  m = 100 #number of bootstrap iterations

  moth_result = matrix(
    nrow = m,
    ncol = n)

  # The outer loop: runs once for each bootstrap iteration.  index variable is i
  for(i in 1:m)
  {
    # The inner loop: simulates increasing sampling intensity
    # Sampling intensity ranges from 1 site to the complete count of sites (24)
    # index variable is j
    for(j in 1:n)
    {

      # sample the input data row indices, with replacement
      rows_j = sample(n, size = j, replace=TRUE)

      # Creates a new data matrix
      t1 = moth_dat[rows_j, ]

      # Calculates the column sums
      t2 = apply(t1, 2, sum)

      # Counts the number of columns in which any moths were observed
      moth_result[i, j] = sum(t2 > 0)
    }
  }

  return(moth_result)
}

rarefact = rarefaction_sampler(moth_dat, 100)
head(rarefact)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
## [1,]    4    6    8   10    8   10   10    9   10     9    10     9    10     9    10    10    10
## [2,]    2    7    8    7    8    9   10    8   10    10    10    10    10    10    10    10    10
## [3,]    4    8    7    9    8   10   10   10    8     9    10    10    10    10    10    10    10
## [4,]    3    8    9    8    9    8   10    9    9     9    10    10    10     9    10    10    10
## [5,]    4    4    7    8    9    8   10   10   10     9    10    10    10    10    10    10    10
## [6,]    8    7    6    8   10    8    9    9   10     9    10    10    10    10    10    10    10
##      [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,]    10    10    10    10    10    10    10
## [2,]    10    10    10    10    10    10    10
## [3,]    10    10    10    10    10    10    10
## [4,]    10    10    10    10    10    10    10
## [5,]    10    10    10    10    10    10    10
## [6,]    10    10    10    10    10    10    10

Second Draft

That runs, but I know I’m not done because I haven’t changed any code in the function body.

In the second draft, I’ll check that the body of my function does not make reference to any variables that are not defined either by the arguments or within the function body.

I know that m was defined already outside of the function, so I want to replace it with n_iterations.
I know that n was defined as the number of rows in moth_dat.
I want my function to work for any input dataframe, so I’ll replace it with a new variable, n_input_rows, that I define within the function body.
I’ll get rid of moth_result, and instead create a new output matrix called results_out

rarefaction_sampler = function(input_dat, n_iterations)
{
  n_input_rows = nrow(input_dat)

  results_out = matrix(
    nrow = n_iterations,
    ncol = n_input_rows)

  # The outer loop: runs once for each bootstrap iteration.  index variable is i
  for(i in 1:n_iterations)
  {
    # The inner loop: simulates increasing sampling intensity
    # Sampling intensity ranges from 1 site to the complete count of
    # sites in the input data (n)
    # index variable is j
    for(j in 1:n)
    {
      # sample the input data row indices, with replacement
      rows_j = sample(n, size = j, replace=TRUE)

      # Creates a new data matrix
      t1 = input_dat[rows_j, ]

      # Calculates the column sums
      t2 = apply(t1, 2, sum)

      # Counts the number of columns in which any moths were observed
      results_out[i, j] = sum(t2 > 0)
    }
  }
  return(results_out)
}

rarefact = rarefaction_sampler(moth_dat, 100)
head(rarefact)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
## [1,]    4    6    9   10   10    9   10    9    9     9    10    10    10    10    10    10    10
## [2,]    6    8   10    8    8    9    7    9   10    10     9     9    10    10    10    10    10
## [3,]    3    5    6    8    9    8   10   10   10     9    10    10    10    10    10    10     9
## [4,]    5    8    8    9   10   10   10   10   10    10     9     9     9    10    10    10    10
## [5,]    3    9    9    7   10   10   10   10   10     9     9    10    10    10    10    10    10
## [6,]    4    7    9   10    8   10   10    9    9    10    10    10    10    10     9    10    10
##      [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,]    10    10    10    10    10    10    10
## [2,]    10    10    10    10    10    10    10
## [3,]    10    10    10    10    10    10    10
## [4,]    10    10    10    10    10    10    10
## [5,]    10    10    10    10    10    10    10
## [6,]    10     9    10    10    10     8    10

It runs!

But just to be sure, I’m going to run my new function in a fresh R environment:

Check in a fresh environment

# This clears the current R session's environment
rm(list = ls())

# Re-read my data:
moths = read.csv(here("data", "moths.csv"))
moth_dat = moths[,-1]

rarefaction_sampler = function(input_dat, n_iterations)
{
  n_input_rows = nrow(input_dat)

  results_out = matrix(
    nrow = n_iterations,
    ncol = n_input_rows)

  # The outer loop: runs once for each bootstrap iteration.  index variable is i
  for(i in 1:n_iterations)
  {
    # The inner loop: simulates increasing sampling intensity
    # Sampling intensity ranges from 1 site to the complete count of
    # sites in the input data (n)
    for(j in 1:n)
    {
      # sample the input data row indices, with replacement
      rows_j = sample(n, size = j, replace=TRUE)

      # Creates a new data matrix
      t1 = input_dat[rows_j, ]

      # Calculates the column sums
      t2 = apply(t1, 2, sum)

      # Counts the number of columns in which any moths were observed
      results_out[i, j] = sum(t2 > 0)
    }
  }
  return(results_out)
}

rarefact = rarefaction_sampler(moth_dat, 100)

## Error in rarefaction_sampler(moth_dat, 100): object 'n' not found

head(rarefact)

## Error in head(rarefact): object 'rarefact' not found

What happened?

I got the error message: “Error in rarefaction_sampler(moth_dat, 100) : object ‘n’ not found”
How should I proceed?

I’ll let you find the error and fix it for the first lab exercise question!

Debugging template

You can test your corrected function using the following template:

# This clears the current R session's environment
rm(list = ls())

# Re-read my data:
moths = read.csv(here("data", "moths.csv"))

rarefaction_sampler = function(input_dat, n_iterations)
{
  ... paste your corrected code here ...
}

rarefact = rarefaction_sampler(moths[,-1], 100)
head(rarefact)

Building the Rarefaction Curve

You are now ready to continue with the rarefaction.

Run the simulator with 10000 iterations:

# Re-read my data:
moths = read.csv(here("data", "moths.csv"))
rarefact = rarefaction_sampler(moths[,-1], 10000)

##      X6 X8 X6.1 X10 X10.1 X9 X10.2 X10.3 X10.4 X9.1 X10.5 X10.6 X10.7 X10.8 X9.2 X10.9 X10.10
## [1,]  4  8    8   9    10  9     9    10     8   10     9    10    10    10   10    10     10
## [2,]  2  4    8   8     9 10    10     9     9   10    10    10    10    10   10    10     10
## [3,]  3  9    6   7     9 10     8     9     9   10    10    10    10    10   10    10     10
## [4,]  4  7    8   9     9  9    10    10    10    9     9    10    10    10   10    10     10
## [5,]  3  2   10   5     9 10     9     9    10    9    10    10    10    10   10    10     10
## [6,]  3  8    8   8     9  8    10    10     8   10    10    10    10    10   10    10     10
##      X10.11 X10.12 X10.13 X10.14 X10.15 X10.16 X10.17
## [1,]     10     10     10     10     10     10     10
## [2,]     10     10     10      9     10     10     10
## [3,]     10     10     10     10     10     10     10
## [4,]     10     10     10     10     10     10     10
## [5,]     10     10     10     10     10     10     10
## [6,]     10     10     10     10     10     10     10

It make take several minutes to run with 10 thousand iterations, so be patient.

When it is finished, the object rarefact now contains 10,000 rows and 24 columns, where each row is a separate bootstrap iteration and each column represents the size of the bootstrap sample.

In this case the column number represents the number of sample observations or sampling intensity ranging from 1 to 24.

We can calculate the mean and 2.5% and 97.5% quantiles of the bootstrapped species richness for each sampling intensity using apply() function, as follows:

For convenience, let’s bind the objects together and transpose the data frame so that the columns represent the mean, 2.5% and 97.5% quantiles, and the rows represent sampling intensity ranging from 1 to 24.

rare_mean = apply(rarefact, 2, mean)
rare_quant = apply(rarefact, 2, quantile, probs=c(0.025, 0.975))
rare = t(rbind(rare_mean, rare_quant))

Plotting the curve

We can plot the rarefaction curve and the 95% confidence interval using the matplot() function, which is useful for simultaneous plotting of several columns of a data frame or matrix.

We can also add a legend to make the plot complete, as follows:

matplot(
  rare,
  type='l',
  xlab='Number of sampling plots',
  ylab='Species richness',
  main="Mike's Awesome Rarefaction Curve")

legend(
  'bottomright',
  legend=c('mean','2.5%','97.5%'),
  lty=c(1,2,3),col=c(1,2,3), inset=c(.1,.1))

What does the rarefaction curve indicate about the species-sampling intensity relationship?

Do you think 5 sample plots are sufficient to get a reliable estimate of then number of species present in the study area?

What about 10 or 15? If you sampled 10 plots, what is your estimate of the number of species present and what is your confidence in that estimate?

Lab Questions

Penguin Parametric CI Q1-5

Review the bootstrap and parametric confidence interval materials in the lab walkthrough.

Calculate a parametric 95% CI for mean bill length (in mm) for the Gentoo penguins in the penguins dataset from package palmerpenguins using your SSE function. For this calculation you should use Student’s t-distribution to calculate the critical values.

Q1 (1 pt.): What is the sample size, n? Show the code you used for the calculation and remember to check for missing data.
Q2 (1 pt.): What is the sample standard deviation? Show the code you used for the calculation.
Q3 (2 pts.): What are the critical t-values? Show the R code you used for the calculation. Make sure you include both values.
Q4 (1 pt.): What is the sample standard error? Show the R code you used for the calculation.
Q5 (2 pts.): Finally, construct the CI and show the R code you used for the calculation.

Penguin Bootstrap CI Q6-8

Review the bootstrap confidence interval materials in the lab walkthrough.

Calculate a bootstrap 95% CI for mean bill length (in mm) for the Gentoo penguins in penguins dataset from package palmerpenguins.
Use the boot() function from package boot

Q6 (1 pt.): What is the CI?
Q7 (1 pt.): Show the r code you used to call the boot() function.
Q8 (2 pts.): Show the r code you used to calculate the upper and lower 2.5% quantiles.

Rarefaction Sampler Q9-13

You’ll be using your rarefaction_sampler() function together with the rare moth abundance data.

Review the rarefaction curve materials in the lab walkthrough and complete the the rarefaction_sampler() function code.

Be sure to test out your completed function using the debugging template to make sure it works correctly.

Calculate a rarefaction curve using 10000 replicate simulations.

Create a plot of your moth sampling rarefaction curve using the code in the lab walkthrough as a template. Include a 95% confidence envelope, i.e. upper 97.5 and and lower 2.5%.

Note: my example plot has a very generic title and the legend might not be very informative to somebody who wasn’t already familiar with rarefaction curves and confidence envelopes!
You should improve upon my plot. Consider updating features such as the:
- Legend text, position, size, etc.
- Line colors, width, and type
- Plot annotations (title, axes, etc.)
Optional challenge question: can you shade the area of the plot inside the confidence envelope using a light gray (or other light color)?

Q9 (5 pts.): Show your completed rarefaction_sampler() function.
Q10 (1 pt.): What did you find most difficult about building the function?
Q11 (4 pts.): Show the code you used to perform the simulations and construct the curve.
Q12 (4 pts.): Include your rarefaction curve plot in your report.
Q13 (2 pts.): About how many sites should you visit if you want to see all of the moth species? Explain your reasoning using your rarefaction curve figure.

Analysis of Environmental Data Lab

Lab 07: The Bootstrap