In addition to my original materials, this lab contains material adapted from Kevin McGarigal’s ECO 634 course.
Kevin taught this course for many years and his lab and lecture materials are an invaluable resource!
You’ll need to retrieve the following files:
vegdata.csv
bird.sub.csv
hab.sub.csv
Save them in the data
subdirectory of your main
course RProject folder.
In the Bootstrap lab (lab 7), you used bootstrap resampling to calculate confidence intervals and create a rarefaction curve.
In this lab, we’ll be using bootstrap resampling to explore sampling distributions of model parameters in the context of one- and two-sample difference of means tests.
Recall that t-tests are useful when we have a categorical predictor with two levels and a continuous response variable.
We’ll use the palmerpenguins
dataset again, but this
time, let’s remove the Gentoo penguins:
Don’t forget to use droplevels()
Perform a t-test of the alternative hypothesis that Adelie penguins have shorter flippers than Chinstrap penguins. Is this a one- or two-tailed test?
t.test(flipper_length_mm ~ species, data = penguin_dat, alternative = "less")
##
## Welch Two Sample t-test
##
## data: flipper_length_mm by species
## t = -5.7804, df = 119.68, p-value = 3.025e-08
## alternative hypothesis: true difference in means between group Adelie and group Chinstrap is less than 0
## 95 percent confidence interval:
## -Inf -4.186534
## sample estimates:
## mean in group Adelie mean in group Chinstrap
## 189.9536 195.8235
Take note of the confidence interval, the p-value, and the group means.
In a previous lab you calculated bootstrap confidence intervals for the mean value of one group of observations.
You can also create a bootstrap confidence interval on the difference of two means… That sounds a lot like a t-test!
two.boot()
You could write your own custom function to do a two-sample
bootstrap, but you are already a whiz at writing functions so you can
save some time by checking out two.boot()
in the package
simpleboot
. Remember that you’ll have to install
simpleboot
before you can use any of its functions.
two.boot()
. I’ll let you
figure out the syntax since you’re getting good at R.## List of 12
## $ t0 : num -5.87
## $ t : num [1:10000, 1] -5.29 -4.02 -6.03 -6.66 -6.89 ...
## $ R : num 10000
## $ data : int [1:219] 181 186 195 193 190 181 195 193 190 186 ...
## $ seed : int [1:626] 10403 337 -31981972 -306496208 255981706 -339368191 1223042828 1397488521 95684349 556754659 ...
## $ statistic:function (x, idx)
## $ sim : chr "ordinary"
## $ call : language boot(data = c(sample1, sample2), statistic = boot.func, R = R, strata = ind, weights = weights)
## $ stype : chr "i"
## $ strata : num [1:219] 1 1 1 1 1 1 1 1 1 1 ...
## $ weights : num [1:219] 0.00662 0.00662 0.00662 0.00662 0.00662 ...
## $ student : logi FALSE
## - attr(*, "class")= chr "simpleboot"
## - attr(*, "boot_type")= chr "boot"
I saved the output of two.boot to an object called
pen_boot
. It’s an complex object (actually a list) with
lots of named sub-parts. Recall we can use str()
to examine
the structure of an object:
str(pen_boot)
## List of 12
## $ t0 : num -5.87
## $ t : num [1:10000, 1] -5.29 -4.02 -6.03 -6.66 -6.89 ...
## $ R : num 10000
## $ data : int [1:219] 181 186 195 193 190 181 195 193 190 186 ...
## $ seed : int [1:626] 10403 337 -31981972 -306496208 255981706 -339368191 1223042828 1397488521 95684349 556754659 ...
## $ statistic:function (x, idx)
## $ sim : chr "ordinary"
## $ call : language boot(data = c(sample1, sample2), statistic = boot.func, R = R, strata = ind, weights = weights)
## $ stype : chr "i"
## $ strata : num [1:219] 1 1 1 1 1 1 1 1 1 1 ...
## $ weights : num [1:219] 0.00662 0.00662 0.00662 0.00662 0.00662 ...
## $ student : logi FALSE
## - attr(*, "class")= chr "simpleboot"
## - attr(*, "boot_type")= chr "boot"
The bootstrapped differences are stored in the component called
t
. How can you retrieve them?
When you figure out how to retrieve the differences, you can make a histogram:
We’ll examine a dataset from a field experiment designed to assess tree seedling response to understory vegetation treatments.
Four treatments (including a control) were randomly assigned to 32 plots in a randomized block design. We’ll not worry about the randomized block design part of the experiment.
For now, let’s consider just the number of pine seedlings (pine) under the various treatments.
Let’s visualize the data using box conditional boxplots:
boxplot(pine ~ treatment, dat = veg)
Since we’re focusing on two-sample tests, let’s create a new data frame that contains only the observations that received the “clipped” or “control” treatments.
You can use the subset()
function in conjunction with a
new operator: %in%
:
dat_tree = droplevels(subset(veg, treatment %in% c("control", "clipped")))
droplevels()
.Create a new conditional boxplot on the new data to verify that your subsetting worked:
Use table()
to determine how many observations are in
each of the treatments. I’ll let you figure out the syntax. Hint: you’ll
need to tell table()
which column of the data to use.
Conduct a Wilcoxon ranked sum test on the difference in means between the treatments.
The syntax for wilcox.test()
is nearly identical to
that of t.test()
.
What was the p-value associated with your test?
Conduct a bootstrap of the tree data. For this demo code, I’m using
the object tree_boot
that I created using
two.boot()
.
What are the endpoints of a 95% CI?
I can use the boot.ci()
function to give find out!
boot.ci(tree_boot)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = tree_boot)
##
## Intervals :
## Level Normal Basic
## 95% ( 3.23, 28.83 ) ( 2.50, 27.75 )
##
## Level Percentile BCa
## 95% ( 4.25, 29.50 ) ( 4.38, 29.75 )
## Calculations and Intervals on Original Scale
The function returns four different confidence intervals. Consult the help entry for the reference describing the different interval types. The ‘Percentile’ interval is the type of CI we’ve manually calculated for bootstrapped data in the bootstrapping lab.
To prove this to yourself, you can use the quantile()
function and see if the results match:
quantile(tree_boot$t, c(0.025, 0.975))
## 2.5% 97.5%
## 4.25 29.50
Finally, we can examine a histogram of the bootstrapped differences in means:
We can use Monte Carlo randomization to assess the significance of regression parameters.
We’ll estimate the significance of a slope parameter of a simple linear regression.
Although R’s lm()
function produces a fitted model with
parametric significance values for model coefficients, you may want to
use other methods to assess significance. For example, if your data
don’t meet the assumptions of a Frequentist linear regression,
resampling techniques can sidestep some of the assumptions.
We’ll use a standardized version of the Oregon birds data.
Z-standardization is a form of normalization.
When we modeling data that has values that span different orders of magnitude, normalizing ensures that the range of values for each variable span approximately the same range.
This is helpful for comparing model coefficients.
The data in the subbasin files (which we are using here) are aggregated by sub-basin from the full bird census count data we have used previously.
There are habitat variables for 30 subbasins.
You’ll need to retrieve the bird.sub.csv
and
hab.sub.csv
files from the course github site (see the Data
Files section above).
See the corresponding metadata file (birdhab.meta.pdf) for more info.
Read the data files and merge into a single
data.frame
:
# I already read my data into dat_bird and dat_habitat:
dat_all = merge(
dat_bird,
dat_habitat,
by = c("basin", "sub"))
head(dat_all[, c("b.sidi", "s.sidi")])
## b.sidi s.sidi
## 1 0.06678912 0.12
## 2 0.06509689 0.34
## 3 0.06092608 0.78
## 4 0.06012721 0.57
## 5 0.04112905 0.84
## 6 0.06086158 0.73
Note that the b.sidi
column values are about 1/10 the
value of the s.sidi
column.
R has functions for centering and standardizing data, but to illustrate the z-standardization process, we’ll do it manually. Don’t worry, it’s not very difficult!
Recall the process:
Here’s how to do it for the b.sidi
column. I’ll create a
new column called b.sidi.standardized
:
# Calculate the sample mean and sd:
b_sidi_mean = mean(dat_all$b.sidi, na.rm = TRUE)
b_sidi_sd = sd(dat_all$b.sidi, na.rm = TRUE)
# Use the subset-by-name symbol ($) to create a
# new column of z-standardized values.
dat_all$b.sidi.standardized = (dat_all$b.sidi - b_sidi_mean)/b_sidi_sd
Now, examine the standardized data to see if our standardization worked:
mean(dat_all$b.sidi.standardized)
## [1] 7.166938e-17
sd(dat_all$b.sidi.standardized)
## [1] 1
Note that \(7.2 \times 10^{-17}\) is effectively 0.
s.sidi
column.In this case, the relational fields that link the bird and habitat data sets are:
basin
(3 unique basins)sub
(10 unique sub-basins in each basin).We are only going to use two of the many variables in the data.
We’ll only be using two of the many variables in the merged data:
b.sidi
s.sidi
Note, the latter index represents the diversity in landscape composition as defined largely by vegetation seral (intermediate stages of succession) stage.
Let’s begin by plotting the data:
plot(
b.sidi ~ s.sidi, data = dat_all,
main = "Simpson's diversity indices",
xlab = "Vegetation cover diversity",
ylab = "Bird diversity")
It appears that there is a negative relationship between bird diversity and vegetation diversity; specifically, bird diversity declines as the vegetation diversity increases.
This seems somewhat counter-intuitive, since we usually think of animal diversity increasing with the diversity of habitats (proxied here by vegetation seral stages).
Could this result have been due to chance?
Or is this relationship likely to be real?
We can fit a simple linear regression using a Least Squares criterion.
We can examine the model coefficients with coef()
. We’ll
want to save the slope coefficient for later.
The slope coefficient is labeled s.sidi
since that is
the predictor we specify in the model.
fit_1 = lm(b.sidi ~ s.sidi, data = dat_all)
coef(fit_1)
## (Intercept) s.sidi
## 0.07116980 -0.02437131
slope_observed = coef(fit_1)[2]
You can easily add a regression line from a model fitted by
lm()
to an existing plot using abline()
and
the fitted model object.
plot(
b.sidi ~ s.sidi, data = dat_all,
main = "Simpson's diversity indices",
xlab = "Vegetation cover diversity",
ylab = "Bird diversity")
abline(fit_1)
We would like to know how likely it is for us to observe a slope coefficient this large and negative if in fact there is no real relationship between bird diversity and vegetation diversity.
We can use the t statistic and corresponding p-value computed by the
lm()
function, but this assumes that the errors are
normally distributed about the mean, which may or may not be the case
here.
To be on the safe side, we are going to construct our own null hypothesis test using a Monte Carlo randomization procedure.
First, let’s simplify our data set by extracting just the two variables we need for this exercise:
dat_1 =
subset(
dat_all,
select = c(b.sidi, s.sidi))
select
argument!Our goal is to use resampling techniques, Monte Carlo and bootstrapping, to characterize the null and alternative distributions.
Monte Carlo randomization breaks up the associations by sampling frandom values from each column, instead of keeping rows intact.
To create a resampled dataset, we can create two vectors of randomly generated row indices. Then we can use these to create two new vectors of bird and vegetation diversity indices.
set.seed(123)
index_1 = sample(nrow(dat_1), replace = TRUE)
index_2 = sample(nrow(dat_1), replace = TRUE)
dat_resampled_i =
data.frame(
b.sidi = dat_1$b.sidi[index_1],
s.sidi = dat_1$s.sidi[index_2]
)
fit_resampled_i = lm(b.sidi ~ s.sidi, data = dat_resampled_i)
slope_resampled_i = coef(fit_resampled_i)[2]
print(slope_resampled_i)
## s.sidi
## 0.006235381
And we can re-create the scatterplot with regression line:
plot(
b.sidi ~ s.sidi, data = dat_resampled_i,
main = "Simpson's diversity indices (MC resampled data)",
xlab = "Vegetation cover diversity",
ylab = "Bird diversity")
abline(fit_resampled_i)
We need to do this many times to see what the range of outcomes we expect to see by chance alone.
We know that because of sampling error, a single permutation could by chance have a slope that is different from nearly zero.
We can repeat the process many times to estimate the sampling distribution of the null hypothesis.
First we can pre-allocate a vector to hold the results using
numeric()
. Check out the help entry to find out more about
numeric()
.
m = 10000
result_mc = numeric(m)
Now it’s up to you to build a loop that will resample the data, fit a simple linear regression, and extract the slope parameter. Here’s a skeleton:
for(i in 1:m)
{
index_1 = sample(...,,)
# ... your loop code ...
result_mc[i] = coef(fit_resampled_i)[2]
}
The output of your loop is a collection of regression slope parameters that were calculated on randomized and resampled data.
Plot a histogram of the slope values.
abline()
with the argument
v = slope_observed
to draw a vertical line showing the
value of the slope from the regression model on the observed data.Your result should look something like this:
hist(
result_mc,
main = "Mike's Null Distribution of Regression Slope",
xlab = "Slope Parameter")
abline(v = slope_observed, lty = 2, col = "red", lwd = 2)
Just like finding the critical z-value for confidence intervals, we can calculate a critical value for the slope.
quantile()
to find the 5th percentile of the null
distribution of slopes.In my simulation I got:
quantile(result_mc, c(.05))
## 5%
## -0.01320388
Do you remember what the observed slope of the real data was?
How many slopes from the Monte Carlo randomization were equal to or less than the observed slope?
For my simulation, I found 18 slopes (out of 10000 simulations) generated from the null distribution that were equal to or less than the observed slope.
If we want an exact p-value for this lower one-side test, we can compute the percentage of the permuted distribution less than or equal to our observed slope value.
## [1] 0.0018
Now that we’ve characterized the null distribution, let’s use bootstrapping to characterize the alternative.
We’ll need slightly different code, and the canned bootstrapping methods in the packages we’ve used won’t help us. But don’t worry, you’ve got the R skills to handle this task!
Recall that for bootstrapping we sample entire rows. This means we only need 1 set of resampled row indices.
set.seed(345)
index_1 = sample(nrow(dat_1), replace = TRUE)
dat_boot = dat_1[index_1, ]
head(dat_boot)
## b.sidi s.sidi
## 29 0.08263485 0.00
## 23 0.05705873 0.62
## 19 0.05820778 0.54
## 21 0.07254766 0.41
## 18 0.06365076 0.49
## 28 0.06284046 0.36
Now, let’s build another linear model and compare the slope coefficient to what we calculated with the original data:
fit_bs1 = lm(b.sidi ~ s.sidi, data = dat_boot)
coef(fit_bs1)
## (Intercept) s.sidi
## 0.07489893 -0.03146039
We can compare this with the coefficients of the model of the original data:
Model | Intercept | Slope |
---|---|---|
Original | 0.071 | -0.024 |
Bootstrap | 0.075 | -0.031 |
They are pretty close!
Now, we need to repeat the process many times and record the slope coefficients.
The code will be very similar to the code you used for your Monte Carlo loop. I’ll let you figure out the code. Hint: you only need to make a few modifications to your Monte Carlo loop.
A histogram of my bootstrapped model slopes looks like:
hist(
result_boot,
main = "Mike's Alternative Distribution of Regression Slope",
xlab = "Slope Parameter")
abline(v = slope_observed, lty = 2, col = "red", lwd = 2)
abline(v = 0, lty = 2, col = 1, lwd = 2)
Note that I added vertical lines at the observed slope value from the original data and at zero.
We can compare histograms of the null and alternative distributions:
It would be nice if we could plot them on the same panel. It’s difficult to plot two histograms, but we can create density plots.
Here’s the syntax to make a density plot of the null distribution:
plot(
density(result_mc),
main = "Mike's Null Distribution Density Plot",
xlab = "Slope Coefficient")
See if you can use similar syntax to plot the alternative distribution:
Now for a challenge: let’s plot both on the same panel.
I’ll let you figure out the code, but I’ll give you a few hints.
lines()
function to plot your alternative
density curve. The syntax will be similar to you call to
plot()
.xlim
and ylim
arguments for plot()
.legend()
function. I used the arguments x = "topright"
to position
my legend in the upper right and bty = "n"
to suppress the
drawing of a box around the legend.My double density plot looks like:
two.boot()
function to calculate 10000
bootstrap replicates of the difference in mean flipper length of
Chinstrap and Adelie penguins.mean()
and sd()
to exclude NA
values?two.boot()
to a variable called
pen_boot
. This is so that I can replicate your code easily
on my machine for grading and assistance.quantile()
with pen_boot
to calculate
a 95% Confidence interval on the difference in mean flipper
lengths.pen_boot
to figure out what component contains the
bootstrapped differences.quantile()
? Show the R-code you used to
answer the question.pen_boot
using
ecdf()
.ecdf()
pen_ecdf
.pen_ecdf()
to calculate the empirical probability
of observing a mean difference of -4.5 or greater.pen_ecdf()
to calculate the empirical probability
of observing the mean difference predicted by the null hypothesis,
i.e. 0 or greater.The ecdf()
function is a little different than other
functions that we’ve worked with because it returns a new function
instead of an R data object.
pen_ecdf()
is a lot like pnorm()
. It
calculates the area under the density curve to the left of
x.We get the area of the blue shaded area with
pnorm(-0.5)
, which turns out to be 0.3085375.
pen_ecdf()
is very similar.pen_ecdf()
Consider the null and alternative hypotheses for a two-sample, and two-tailed, test of the difference in mean flipper length between the two penguin species.
data.frame
called
veg
.droplevels
and subset
to keep just the
control and clipped levels of the treatment
column.To receive full credit, you must use the formula notation in R to conduct the test.
two.boot()
to create a bootstrapped data set of the
differences in mean pine tree count between the clipped and control
treatments.tree_boot
.quantile()
to find a 95% CI.s.sidi
and b.sidi
column
of dat_all
. Add the standardized data to
dat_all
using the new column names
s.sidi.standardized
and
b.sidi.standardized
.quantile
to find the 5% quantile of slopes in the
null distribution. This is the critical value.abline()
along with the argument v =
to add a vertical line showing where the observed slope occurred.abline()
along with the argument v =
to add a vertical line showing the critical value.s.sidi
column.