Objectives

  • Practice using variouis models (and statistical) tests on some data.

Lab Credits

In addition to my original materials, this lab contains material adapted from Kevin McGarigal’s ECO 634 course.

Kevin taught this course for many years and his lab and lecture materials are an invaluable resource!

Catastrophic Rate Data

Recall the salamander catastrope data we worked with in the Using Models 1 lecture assignment.

catrate = read.csv(here("data", "catrate.csv"))
head(catrate)
##   pond success years  cat.rate
## 1    2       5     7 0.2857143
## 2    3       5     7 0.2857143
## 3    4       6     7 0.1428571
## 4    5       4     7 0.4285714
## 5    6       0     7 1.0000000
## 6    7       1     4 0.7500000

Catrate Data Columns

To review, the data columns represent

  • pond: the ID of the pond
  • success: The number of years in which successful reproduction occurred at the pond
  • years: The total number of years that the pond was observed.
  • cat.rate: the ratio of successes to total observation years.

Comparing to the late fill rate

In the Using Models 1 lecture assignment, we used the t and Wilcox tests to test whether the catastrophic rate was equal to the late fill rate.

In those test examples, we treated the data as continuous, but in fact the data are discrete, taking on integer values ranging from 0 to 7.

Binomial Test for Proportions

An alternative test for ratios is the binomial test, which is suitable for data consisting of a set of single trials in which each observation can take on one of two values; e.g., success or fail, present or absent, live or die.

Reproductive Success and Failure

We might ask the question:

  • What is the evidence that reproductive success is more (or less) likely than reproductive failure?

We can investigate the answer with a two-sided binomial test for proportions.

Pooling all of the data in the 14 ponds, we observed a total of 33 reproductive successes out of 61 total trials.

How likely is a response of 33/61 if the reproductive success and failure are equally likely, i.e., \(Pr(success)=0.5\)?

  • We can use a binomial test for this, specifying the number of successes (33) and the total sample size (61), as follows:
n_success = sum(catrate$success)
n_years = sum(catrate$years)
binom.test(
  x = n_success,
  n = n_years,
  p = 0.5)
## 
##  Exact binomial test
## 
## data:  n_success and n_years
## number of successes = 33, number of trials = 61, p-value =
## 0.6089
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.4084989 0.6693559
## sample estimates:
## probability of success 
##              0.5409836

Note, we had to first compute the number of successes and the total number of years pooled across all 14 ponds.

Also note the similarity in the syntax to dbinom(), but with slightly different argument names.

  • Which argument would we change if we wanted to test whether the catastrophic rate was different than 20%?

Reproductive Catastrophe and Late Filling

Recall that we know that ponds experience late filling in approximately 2 out of every 7 years:

Marbled salamanders lay eggs in dry vernal pool locations in the fall.

Late pond-filling can cause the desiccation and/or freezing of the eggs prior to inundation.


Let’s define variables to hold the late- and normal-filling rates:

late_fill_rate = 2/7
normal_fill_rate = 1 - late_fill_rate

Instead of testing the null hypothesis of equal probability of success and failure (i.e., p=0.5), we might instead ask the question:

  • What is the evidence that reproductive success is more or less frequent than the normal-filling rate?

In this scenario, we expect successful reproduction in approximately 5 of every 7 years.

We can modify the code we used above to test the observed reproduction success rate against the normal-filling rate:

binom.test(
  x = n_success,
  n = n_years,
  p = normal_fill_rate) 
## 
##  Exact binomial test
## 
## data:  n_success and n_years
## number of successes = 33, number of trials = 61, p-value =
## 0.004227
## alternative hypothesis: true probability of success is not equal to 0.7142857
## 95 percent confidence interval:
##  0.4084989 0.6693559
## sample estimates:
## probability of success 
##              0.5409836

In addition, note again that the default test is a two-sided alternative

We might instead prefer the one-sided alternative hypothesis that the observed success rate is less than the pond normal-filling rate. We can perform the test as follows:

binom.test(
  x = n_success,
  n = n_years,
  p = normal_fill_rate,
  alternative ='less')
## 
##  Exact binomial test
## 
## data:  n_success and n_years
## number of successes = 33, number of trials = 61, p-value =
## 0.002984
## alternative hypothesis: true probability of success is less than 0.7142857
## 95 percent confidence interval:
##  0.0000000 0.6507238
## sample estimates:
## probability of success 
##              0.5409836

Comparison with one-sample tests.

What do these binomial tests say about the success (or catastrophe) rate?

  • Do these results agree with the Student’s t test and/or Wilcoxon’s signed rank test?
  • Which test do think is most appropriate? Why?
Click for a reminder on how to conduct the t and Wilcoxon tests

Here’s the t-test syntax. The Wilcoxon test syntax is similar.

t.test(catrate$cat.rate, mu = 2/7)
## 
##  One Sample t-test
## 
## data:  catrate$cat.rate
## t = 2.9595, df = 12, p-value = 0.01193
## alternative hypothesis: true mean is not equal to 0.2857143
## 95 percent confidence interval:
##  0.3526250 0.7261295
## sample estimates:
## mean of x 
## 0.5393773

Two Sample Tests

Now we’ve see a few one- and two-sample tests in practice. To summarize, we can use classical two-sample tests to:

  • compare two variances
  • compare two sample means or medians
  • check the correlation of two variables
  • compare two (or more) proportions
  • test for independence of two variables in a contingency table

The following sections present examples of each of these tests.

Comparing two variances

Before we can carry out a test to compare two means using classical methods (see below), we need to test whether the sample variances are significantly different

The simplest test is called Fisher’s F test, based on the F-statistic.

The F-statistic represents the ratio between two variances.

  • The F-statistic follows an F-distribution. This distribution has two parameters:
  • numerator degrees of freedom
  • denominator degrees of freedom
  • We’ll encounter the F-statistic and distribution again when we do Analysis of Variance.
  • It is based on the idea that if the variances of the two samples are the same, then the ratio of the variances will be close to 1.

In order to be significantly different, the ratio will need to be significantly smaller or larger than 1, depending on whether the smaller variance is in the numerator or denominator.

F-distribution Example: Vegetation Data

These data come from a field experiment designed to assess tree seedling response to understory vegetation treatments.

Specifically, four treatments (including a control) were randomly assigned to 32 plots in a randomized block design

veg = read.csv(here("data", "vegdata.csv"))
head(veg)
##   block plot date treatment birch pine fern
## 1     A   A3 1995   control     0    4  260
## 2     A   A7 1995   control     0    0  186
## 3     A   A4 1995   removed     8    8   46
## 4     A   A6 1995   removed     6   28    1
## 5     A   A5 1995     mixed     0    1  309
## 6     A   A8 1995     mixed     0    0  258

For the moment, we’ll only consider the pine seedling count (column pine) and we’ll ignore the block experimental design.

Let’s visualize the data using box plots, as follows:

boxplot(pine ~ treatment, data = veg)

It is apparent from the adjacent box plots that the means (and medians) differ among the treatments, but also that the variances differ as well.

Variance test

Let’s test whether the variance in pine seedling count differs between two treatments:

  • control (do nothing)
  • clipped (continuous clipping of fern fronds)

First, let’s subset the data:

veg2 = droplevels(
  subset(
    veg,
    treatment %in% c('control','clipped')
  ))

# verify that treatment is factorized
veg2$treatment = factor(veg2$treatment)

Now, perform the test:

var.test(
  pine ~ treatment,
  data = veg2)
## 
##  F test to compare two variances
## 
## data:  pine by treatment
## F = 119.3, num df = 7, denom df = 7, p-value = 1.92e-06
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   23.88333 595.86810
## sample estimates:
## ratio of variances 
##           119.2951

What does Fisher’s F test say about equal variances for these two samples?

F-tests Assumes Normality

  • Note that Fisher’s F test for unequal variances assumes that the data are normally distributed.

Visual inspection of the box plots indicates that this may be the case, but we might want to test for normality using one of the tests described previously, for example:

shapiro.test(veg2$pine[veg2$treatment=="control"])
## 
##  Shapiro-Wilk normality test
## 
## data:  veg2$pine[veg2$treatment == "control"]
## W = 0.89397, p-value = 0.2547
shapiro.test(veg2$pine[veg2$treatment=="clipped"])
## 
##  Shapiro-Wilk normality test
## 
## data:  veg2$pine[veg2$treatment == "clipped"]
## W = 0.81806, p-value = 0.04452

Note, because the Shapiro-Wilk test is a one-sample test, we had to select the records for each treatment and conduct separate tests.

Note also the use of logical subsetting!

Non-parametric Variance Test

If the results indicate that the data are non-normal, then we should use a non-parametric test of homogeneity of variances, such as the Fligner-Killeen test, as follows:

fligner.test(
  pine ~ treatment,
  data = veg2)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  pine by treatment
## Fligner-Killeen:med chi-squared = 8.3686, df = 1, p-value =
## 0.003818
  • What does the Fligner-Killeen test say about homogeneity of variances for these two samples?

Thus far, we have been concerned with comparing the variances of two samples.

However, there are roughly equivalent tests for n-sample problems; i.e., when there are more than two groups.

Tests for multiple variances

The n-sample parametric test is called Bartlett’s test, which we can use to test for homogeneity of variances among all four treatment levels as follows:

bartlett.test(pine ~ treatment, data = veg)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  pine by treatment
## Bartlett's K-squared = 63.741, df = 3, p-value = 9.323e-14

What does Bartlett’s test say about homogeneity of variances among all 4 treatments?

  • Note that Bartlett’s test, like Fisher’s F test is highly sensitive to non-normality and the presence of outliers.

The non-parametric alternative test which is largely preferred by many statisticians is called the Fligner-Killeen test. We used it to test two variances above, but it can test n variances as well:

fligner.test(pine ~ treatment, data = veg)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  pine by treatment
## Fligner-Killeen:med chi-squared = 19.707, df = 3, p-value =
## 0.0001952
  • Do the results agree with Bartlett’s test?
  • What do you notice about p-values when we use parametric vs. non-parametric tests?
  • Can you see why using parametric tests when the assumptions are not met might lead to incorrect conclusions about significance?

Comparing two sample means

Given what we know about the variation from replicate to replicate within each sample (the within-sample variance), how likely is it that our sample means were drawn from populations with the same mean?

If it is highly likely, then we shall say that our two sample means are not significantly different

If it is rather unlikely, then we shall say that our sample means are significantly different.

But perhaps a better way to proceed is to work out the probability that the two samples were indeed drawn from populations with the same mean.

We’ve already met two simple tests for comparing two sample means (or medians): the Student’s t test and Wilcoxon’s rank-sum test.

T-test

The Student’s t test is appropriate when the samples are independent, the variances constant, and the errors normally distributed. We implemented it as follows for the pine seedling data in the control and clipped treatment plots:

t.test(
  pine ~ treatment,
  data = veg2)
## 
##  Welch Two Sample t-test
## 
## data:  pine by treatment
## t = 2.2825, df = 7.1173, p-value = 0.05582
## alternative hypothesis: true difference in means between group clipped and group control is not equal to 0
## 95 percent confidence interval:
##  -0.5204871 32.5204871
## sample estimates:
## mean in group clipped mean in group control 
##                17.875                 1.875

Because we asked for a confidence interval (conf.int=TRUE), the output includes a 95% (by default) confidence interval on the difference between sample means

A confidence interval that includes 0 indicates that the sample means are not significantly different

What does the two-sample t test say about the differences in pine seedling counts between the control and clipped treatments?

Despite the rather large differences in the sample means (17.9 vs 1.9), can we say with confidence that the sample means are different?

T-test assumptions

Of course, the validity of Student’s t test depends on whether the assumptions have been met.

In this case, we know that variances are significantly different between treatments (see above).

In addition, we know that the “clipped” sample is somewhat non-normally distributed.

Hence, there is good reason to be suspect of the test result.

Wilcox test

We’ve used the Wilcox test in lecture and lab before.

The Wilcoxon’s rank-sum test is appropriate when the samples are independent but the errors are not normally distributed, and is implemented as follows:

wilcox.test(
  pine ~ treatment,
  data = veg2)
## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...):
## cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  pine by treatment
## W = 48, p-value = 0.1005
## alternative hypothesis: true location shift is not equal to 0

Do the results agree with the Student’s t test? Which test result to you have more confidence in, and why?

Tests for paired samples

Sometimes, two-sample data come from paired observations

In this case, we might expect a correlation between the two measurements, either because they were made on the same individual, or were taken from the same location.

A positive covariance between the two samples reduces the variance of the difference between means, which makes it easier to detect significant differences between the means Pairing is not always effective, because the correlation between samples may be weak. In general, however, if you can do a paired t test, then you should always do the paired test.

A good example comes from a data set of mice whose body mass was measured before and after a treatment.

You can obtain the dataset in the datarium package (you’ll need to install it).

require(datarium)
data("mice2")
head(mice2)
##   id before after
## 1  1  187.2 429.5
## 2  2  194.2 404.4
## 3  3  231.7 405.6
## 4  4  200.5 397.2
## 5  5  201.7 377.9
## 6  6  235.0 445.8

The syntax is slightly different because the data are in “wide” format (not exactly the row-data paradigm that we’re used to).

t.test(mice2$before, mice2$after, paired = TRUE)
## 
##  Paired t-test
## 
## data:  mice2$before and mice2$after
## t = -25.546, df = 9, p-value = 1.039e-09
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -217.1442 -181.8158
## sample estimates:
## mean difference 
##         -199.48

You should check the normality assumption to verify that a t-test is appropriate:

## 
##  Shapiro-Wilk normality test
## 
## data:  mice2$before
## W = 0.92399, p-value = 0.3914
## 
##  Shapiro-Wilk normality test
## 
## data:  mice2$after
## W = 0.95283, p-value = 0.7021

According to the Shapiro test, things are good, but the fact that there are only 10 observations in each group might make us a little uncomfortable about assuming normality.

Let’s do a paired Wilcoxon test and compare results. You can figure out the syntax.

## 
##  Wilcoxon signed rank exact test
## 
## data:  mice2$before and mice2$after
## V = 0, p-value = 0.001953
## alternative hypothesis: true location shift is not equal to 0

Compare the p-values of the parametric and nonparametric tests.

  • Which was higher?
  • Did your overall conclusion change?

Now, let’s compare the results of the paired tests to the results of an unapaired test.

t.test(mice2$before, mice2$after, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  mice2$before and mice2$after
## t = -17.453, df = 15.667, p-value = 1.099e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -223.7515 -175.2085
## sample estimates:
## mean of x mean of y 
##    200.56    400.04

When in doubt, choose the non-paired option.

We like to use paired tests when they are appropriate, but it’s not always clear when this is so.

Because a paired test makes additional assumptions about the data (i.e. that pairs of observations are meaningfully related), we can run into problems when there is no true paired structure in the data.

If you’re not sure, you should use the unpaired option; It is always a valid choice. Using a paired test on unpaired data doesn’t make sense, and can make your analysis invalid.

Correlation

In the paired two-sample problem above, we suggested that a strong correlation between the two measurement variables will affect the test for significant differences between means or medians

More generally, with any two continuous variables, x and y, the question naturally arises as to whether their values are correlated with each other

Marbled Salamander

Let’s take the marbled salamander dispersal data introduced in a previous lab

Recall that the data represent the dispersal of first-time breeders (ftb) from their natal ponds, representing juvenile dispersal, and the dispersal of experienced breeders (eb) from their established breeding ponds, representing adult dispersal.

The data set includes three variables:

  • dist.class = distance class, based on 100 m intervals;
  • disp.rate.ftb = standardized dispersal rate for first-time breeders, which can be interpreted as a relative dispersal probability.
  • disp.rate.eb = standardized dispersal rate for experienced breeders, which can be interpreted as a relative dispersal probability

The question arises as to whether the dispersal rates for first-time breeders and experienced breeders are correlated

First, we need to read in the data and check it, as follows:

disp = read.csv(here("data", "dispersal.csv"))
disp
##    dist.class disp.rate.ftb disp.rate.eb
## 1         100          0.00          0.0
## 2         200          0.25          0.8
## 3         300          0.65          0.2
## 4         400          0.92          0.1
## 5         500          0.40          0.3
## 6         600          0.00          0.0
## 7         700            NA           NA
## 8         800          0.20          0.0
## 9         900          0.05          0.0
## 10       1000          0.00          0.0
## 11       1100          0.15          0.0
## 12       1200          0.00          0.0
## 13       1300          0.12          0.0
## 14       1400          0.00          0.0
## 15       1500          0.00          0.0
plot(
  disp.rate.ftb ~ disp.rate.eb,
  data = disp,
  main = "Marbled Salamander Dispersal Rates",
  xlab = "Dispersal Rate\nFirst Time Breeders",
  ylab = "Dispersal Rate\nExperienced Breeders",
  pch = 21, col = 1, bg = "steelblue"
)

Next, we can test the significance of the correlation using the cor.test() function, as follows:

cor.test(
  disp$disp.rate.ftb,
  disp$disp.rate.eb,
  use='complete.obs')
## 
##  Pearson's product-moment correlation
## 
## data:  disp$disp.rate.ftb and disp$disp.rate.eb
## t = 1.2084, df = 12, p-value = 0.2502
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2438293  0.7320180
## sample estimates:
##       cor 
## 0.3293598

Note, we needed to specify the use=’complete.obs’ argument to address the missing values for the 700 m distance class (for which there are no ponds in this particular distance interval).

What does the correlation test say about the significance of the correlation between juvenile and adult dispersal-distance functions?

  • The default correlation test statistic is based on Pearson’s product-moment correlation coefficient (r) cor(x,y) which follows a t distribution with length(x)-2 degrees of freedom if the samples follow independent normal distributions

If the data are non-normal, then a non-parametric rank-based measure of association is more appropriate.

If method is “kendall” or “spearman”, Kendall’s tau or Spearman’s rho statistic is used to estimate a rank-based measure of association

These tests may be used if the data do not necessarily come from a bivariate normal distribution.

Let’s try a test of the Spearman’s rank correlation:

cor.test(
  disp$disp.rate.ftb,
  disp$disp.rate.eb,
  use='complete.obs',
  method='spearman')
## Warning in cor.test.default(disp$disp.rate.ftb, disp$disp.rate.eb,
## use = "complete.obs", : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  disp$disp.rate.ftb and disp$disp.rate.eb
## S = 102.99, p-value = 0.001169
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7736421

What does this test say about the correlation between dispersal-distance functions? Does it agree with the parametric test? Which do you trust more with this data set?

Comparing two distributions

Another way to compare two samples, whether they be paired or not, is to compare the empirical cumulative distributions of the samples

This test is known for its wonderful name, the Kolmogorov-Smirnov test, which is an extremely simple test for asking one of two different questions:

  • Are two sample distributions the same, or are they significantly different from one another in one or more (unspecified) ways?
  • Does a particular sample distribution arise from a particular hypothesized theoretical distribution?

The two-sample problem is the one most often used and the one we will concern ourselves with here

The apparently simple question is actually very broad

It is obvious that two distributions could be different because their means were different.

  • This was the subject of the Student’s t test and Wilcoxon’s rank sum test above

But two distributions with exactly the same mean could be significantly different if they differed in variance, or in skew or kurtosis, or both.

The Kolmogorov-Smirnov test works on empirical cumulative distribution functions (ecdf).

  • Recall that these give the probability that a randomly selected value of X is less than or equal to x.

Let’s see what the ecdf for the sample of juvenile dispersal rate looks like:

plot(
  ecdf(disp$disp.rate.ftb),
  verticals=TRUE,
  main = "Mike's Plot of Marbled Salamanders\nFirst-Time Breeders: ECDF")

Now let’s add the ecdf for the adult dispersal rate, but change the line type (lty) so that we can distinguish it from the ecdf for the juvenile dispersal rate

plot(
  ecdf(disp$disp.rate.ftb),
  verticals=TRUE,
    main = "Mike's Plot of Marbled Salamanders\nFirst-Time and Experienced Breeders: ECDF")
plot(
  ecdf(disp$disp.rate.eb),
  verticals=TRUE,
  lty=3,
  add=TRUE)
legend(
  x = 0.4, y = 0.4,
  lty = c(1, 3),
  legend = c("first-time", "experienced"),
  title = "Breeder Class")

Are these two distributions different? We can use the Kolmogorov-Smirnov test (ks.test) to determine if they differ significantly in any aspect.

The test statistic is the maximum difference in value of the cumulative distribution functions; i.e., maximum vertical difference in the curves for a given value of X

ks.test(disp$disp.rate.ftb,disp$disp.rate.eb)
## 
##  Exact two-sample Kolmogorov-Smirnov test
## 
## data:  disp$disp.rate.ftb and disp$disp.rate.eb
## D = 0.28571, p-value = 0.4265
## alternative hypothesis: two-sided

Is there enough evidence to suggest that the dispersal-distance relationship differs between first-time breeders and experienced breeders?

Comparing two or more proportions

Sex-linked killing

Suppose that only 4 female salamanders were killed crossing the road, compared to 16 males.

  • Is this an example of strong sex-linked risk of road mortality?

Before we can judge, of course, we need to know about the number of male and female candidates.

It turns out that 16 males were killed out of 250 male individuals crossing the road, compared to 4 deaths out of only 40 crossings for females.

Now, if anything, it looks like the females did worse than males in successfully crossing the road (10% mortality for females versus 6% for males).

The question arises as to whether the apparent positively biased mortality rate for females is statistically significant, or whether this sort of difference could arise through chance alone.

  • This is a simple binomial proportions test, which we can easily do in R by specifying two vectors:
  1. the number of mortalities for females and males c(4,16)
  2. the total number of female and male candidates: c(40,250)
prop.test(
  x = c(4,16),
  n = c(40,250))
## Warning in prop.test(x = c(4, 16), n = c(40, 250)): Chi-squared
## approximation may be incorrect
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(4, 16) out of c(40, 250)
## X-squared = 0.24825, df = 1, p-value = 0.6183
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.07629446  0.14829446
## sample estimates:
## prop 1 prop 2 
##  0.100  0.064

A significant p-value indicates that the proportions are different between samples; i.e., that the proportions observed were unlikely to have been drawn from the same underlying population

Conversely, an insignificant p-value means that there is insufficient evidence to reject the null hypothesis and we would conclude that the proportions are not statistically different.

What does the binomial test indicate?

  • Do females suffer a higher mortality rate crossing the road or is the observed difference merely a sampling artifact?
  • How would the result change (p-value) if the sample size was doubled but the proportions killed remained the same (i.e., 8 out of 80 females and 32 out of 500 males)?

Dependence of variables in a contingency table

A great deal of data in ecology and conservation comes in the form of counts (whole numbers or integers), for example:

  • the number of animals that died
  • the number of branches on a tree
  • the number of days of below-freezing days, etc.

With count data, the number 0 is often the value of the response variable; in other words, there are often observations that receive a count of 0

With contingency tables, counts are cross-classified according to one or more categorical contingent variables, where the contingencies are all the events of interest that could possibly happen

A contingency table shows the counts of how many times each of the contingencies actually happened in a particular sample

In a contingency table, each observation is cross-classified according to each of the categorical contingent variables; i.e., each observation is placed into one bin representing a unique categorical level of each contingent variables.

Consider a sample of 40 forest stands where a survey determined that barred owls were either ‘present’ or ‘absent’ and where each stand was classified as either ‘young’ (<80 years) or ‘old’ (>80 years).

Thus, each of the 40 observations could be cross-classified into one of the four contingencies: present-old, present-young, absent-old, and absent-young

We have the following 2x2 contingency table (with expected values shown in parentheses):

old young
present 16 (12.5) 4 (7.5)
absent 9 (12.5) 11 (7.5)

Contingency: Chi-square test

We would like to know whether the observed counts differ from what we would expect if presence/absence was independent of stand age

Do you remember how to calculate the expected values in a contingency table? If not, look this up. This site has a good explanation.

It is clear that the observed frequencies and the expected frequencies are different. But in sampling, everything always varies, so this is no surprise The important question is whether the expected frequencies are significantly different from the observed frequencies.

We can assess the significance of the differences between observed and expected frequencies in a variety of ways. The usual way is with Pearson’s chi-squared test (generalized linear models are an alternative). ) :::{.info} Do you remember how to calculate the chi-square test statistic? If not, look it up.

In R we can compute the chi-squared test as follows:

owls = matrix(c(16, 9, 4, 11), nrow=2)
rownames(owls) = c("present", "absent")
colnames(owls) = c("old", "young")
chisq_owls = chisq.test(owls)
chisq_owls
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  owls
## X-squared = 3.84, df = 1, p-value = 0.05004

The output of the chisq.test() function prints the value of the test statistic (X-squared: 3.84) as well as a p-value (0.05).

What does the result indicate?

  • Are barred owls present more frequently than expected in old forest stands?
  • It’s a little difficult to tell from just the p-value.

Chi-Square Expected and Observed Values

The significant p-value doesn’t tell us whether barred owls are present more or less frequently than expected.

We can compare the observed and expected values of the contingency table for more insight:

The expected values

round(chisq_owls$expected, 1)
##          old young
## present 12.5   7.5
## absent  12.5   7.5

The observed values

chisq_owls$observed
##         old young
## present  16     4
## absent    9    11

Were there more, or fewer, than expected owls observed in the young forests?

Chi-Sqare Residuals

It can be a pain to visually compare the two contingency tables. We can instead calculate the residuals, in other words the difference between the observed and expected values:

round(
  chisq_owls$observed - chisq_owls$expected,
  digits = 1)
##          old young
## present  3.5  -3.5
## absent  -3.5   3.5

Now it’s easy to see that there are fewer owls present in the young forests. We can tell because theresidual is negative.

Fisher’s Exact test

You might recall that Pearson’s chi-squared test expects the expected cell values to be large, generally greater than 4 or 5. As you can seen in the table above, we have met this rule of thumb with the barred owl data, so the Pearson’s chi-squared test is appropriate.

However, when one or more of the expected frequencies is less than 4 (or 5), then it is wrong to use Pearson’s chi-squared test for your contingency table. This is because small expected values inflate the value of the test statistic, and it can no longer be assumed to follow the chi-square distribution

In this case, an alternative test called Fisher’s exact test is more appropriate

We use the function in the same way as before:

fisher.test(owls)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  owls
## p-value = 0.04837
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   1.00906 26.52895
## sample estimates:
## odds ratio 
##   4.683421

Does the conclusion change? Note, the procedures described in this section can all be used with contingency tables much bigger than 2x2

Bird habitat data

The final data set represents standardized breeding bird counts (bird) and a variety of habitat variables (hab) across 1046 sample plots in the Oregon Coast Range

See birdhab.meta.pdf for a complete description of the data sets.

Here, we are interested in knowing whether the presence/absence of brown creepers varies between the interior and edge of forest stands.

Read in the raw data and create a 2x2 contingency table consisting of counts of brown creeper presence (1) versus (0) absence and forest edge (E) versus interior (I):

birds   = read.csv(here("data", "bird.sta.csv"))
hab     = read.csv(here("data", "hab.sta.csv"))
birdhab = merge(
  birds,
  hab, by=c("basin", "sub", "sta"))

# Create a contingency table for edge/interior and brown creeper presence/absence
table(
  birdhab$s.edge,
  birdhab$BRCR > 0)
##    
##     FALSE TRUE
##   E   144   29
##   I   559  314
# set the presence to be in the first column
br_creeper_table = table(
  birdhab$s.edge, 
  birdhab$BRCR > 0)[, 2:1]

br_creeper_table
##    
##     TRUE FALSE
##   E   29   144
##   I  314   559

Make sure you understand the script above:

  • First, we read in the bird and habitat data and merged them into a single file based on the common fields.

  • Then we used the table() function to compute the cross-classified counts.

  • What code converted the Brown Creeper counts to presence/absence?

  • The next step simply switched the order of the columns so that the present counts were in the first column and the absent counts were in the second column, as this is the expected order in some functions (e.g, prop.test())

Are brown creepers present more or less frequently than expected in forest interiors versus forest edges and is the difference significant?

Lab Questions

Before you start the questions, I recommend you work your way through this DataCamp tutorial on R Formulas:

Formulas in R Tutorial

Don’t worry if it doesn’t all make sense yet. You can stop at the “How To Inspect Formulas” section.

You’ve already used formulas with the boxplot() and aggregate() functions. You’ll be using them a lot to build models as well, so it’s worth the effort to get familiar with them!

Q1-2: Chi-square Tests 1

  1. Review the walkthrough materials on testing proportions and contingency tables.
  2. Conduct a Chi-square test on the contingency table of Brown Creeper presence/absence in edge and interior habitats.
  • Q1 (1 pt.): State the null hypothesis of the Chi-square test.
    • Make sure you state the null hypothesis in terms of Brown Creeper presence/absence and edge/interior habitats.
  • Q2 (2 pts.): Consider the results of your test and explain whether you think that Brown Creepers show a significant habitat preference.
    • Make sure your use the output of your statistical test to support your answer.

Q3-5 Building models for ANOVA

A quick pre/re-view of build models for ANOVA

In the lecture assignments, we’ll be conducting Analyses of Variances on the penguin data. In these lab questions we’ll practice checking some of the ANOVA assumptions.

We’ll use the lm() function, which stands for “linear model”, to build models that we can use for ANOVA. To build a simple model of flipper length as a function of species, I would use the following syntax:

require(palmerpenguins)
fit_fl_sp = 
  lm(
    formula = flipper_length_mm ~ species,
    data = penguins)
  • Note that I could also add a second predictor to the model by (slightly) modifying the formula. Re-check how to do this in the DataCamp formula tutorial.
  • When you add your second factor below, make sure you use the correct formula operator for crossing (hint: it’s not the plus symbol).
  • Q3 (1 pt.): Show the R-code you can use to create a model fit (call it fit_species) of penguin body mass as predicted by penguin species.
  • Q4 (1 pt.): Show the R-code you can use to create a model fit (call it fit_sex) of penguin body mass as predicted by sex.
  • Q5 (1 pt.): Show the R-code you can use to create a model fit (call it fit_both) of penguin body mass as predicted by species and sex. This should be an interactive model, i.e. it should include a sex and species interaction.

Q6-9 Homogeneity Assumption: Graphical

We know that the Group 1 methods require homogeneity (constant variance). Conditional boxplots can be a great way to visually check for homogeneity!

If the homogeneity assumption holds, we would expect that all of the groups would have similar variability - the boxes should all be about the same width.

As you know, the boxplot() function will accept a formula and data arguments, just like you used to build you fits.

Create conditional boxplots corresponding to each of your model fits.

Hint: the first draft of my doubly conditional boxplot of species and sex looked like this:

Note the missing labels for some of the factor combinations on the x-axis.

Here are a couple of hints to overcome the issue. (click to show/hide)
  • Check out the use of the names argument for a way to create custom names to be plotted under each box.
  • Check out the las argument.
  • Remember that you can use the special character “\n” to add split text over several lines. For example, look at how I built the title for this plot (don’t worry about understanding the rnorm(100) ~ rbinom(100, 1, 0.5) part, focus on the main and names argument values:
boxplot(
  rnorm(100) ~ rbinom(100, 1, 0.5), 
  main = "Plot of \n 100 Random Numbers", 
  names = c("group 1", "group 2"))

I was able to make some adjustments for the second draft:

  • Q6 (1 pt.): Include a conditional boxplot corresponding to the grouping structure in your fit_species model.
  • Q7 (1 pt.): Include a conditional boxplot corresponding to the grouping structure in your fit_sex model.
  • Q8 (3 pts.): Include a conditional boxplot corresponding to the grouping structure in your fit_both model.
    • Your group labels must all correspond to the correct box, be visible, and sensible.
  • Q9 (3 pts.): Based on the shapes of the boxes, which of the models (if any) do you think may have problems fulfilling the homogeneity assumption?

Q10-12 Homogeneity Assumption: Bartlett Test 1

Review the walkthrough to review how to test for homogeneity of variances.

  • We should verify that the the variances are homogeneous in the groups we specified in our ANOVAs.

Use bartlett.test() to check for homogeneity of the variances in each of your model fits.

  • Q10 (1 pt.): State the null hypothesis of the Bartlett test.
  • Q11 (1 pt.): What was the p-value from the Bartlett test of homogeneity for observations grouped by species?
    • You can round your answer to 4 decimal digits.
  • Q12 (1 pt.): What was the p-value from the Bartlett test of homogeneity for observations grouped by sex?
    • You can round your answer to 4 decimal digits.

Q13-14 Homogeneity Assumption: Bartlett Test 2

Unfortunately, bartlett.test() doesn’t work with the formula syntax you used to make your two-way model or your doubly-conditional boxplot.

You’ll need to use your friend aggregate() to separate the observations by the two factors sex and species.

You need to figure out how to extract separate vectors of body masses for each of the sex/species groups.

Hint: Here’s some code that I used to extract the body mass observations, grouped by island (you’ll use sex and species):

dat_groups = aggregate(
  body_mass_g ~ island,
  data = penguins,
  FUN = c)
str(dat_groups)
## 'data.frame':    3 obs. of  2 variables:
##  $ island     : Factor w/ 3 levels "Biscoe","Dream",..: 1 2 3
##  $ body_mass_g:List of 3
##   ..$ : int  3400 3600 3800 3950 3800 3800 3550 3200 3150 3950 ...
##   ..$ : int  3250 3900 3300 3900 3325 4150 3950 3550 3300 4650 ...
##   ..$ : int  3750 3800 3250 3450 3650 3625 4675 3475 4250 3300 ...

Hint 2:

I can use the $ operator extract a list containing the body mass observations in each group. The name of the list is body_mass_g:

dat_groups$body_mass_g
  • Check the help entry for bartlett.test(). Can you use a list as the value for the x argument?
  • Q13 (1 pt.): What was the p-value from the Bartlett test of homogeneity for observations grouped by both factors?
    • You can round your answer to 4 decimal digits.
  • Q14 (3 pts.): Based on the results of the Bartlett tests, do you anticipate any issues with heterogeneity in any of the models?
    • Make sure you justify your response with the results of your tests.

Q15 - Florida Trees

In 2015, street trees in three neighborhoods in Tampa, FL were surveyed for defects and assessed for probability of failure (Nelson et al., 2022).

The failure probability assessment considers individual tree characteristics such as DBH, crown width, and defects (such as decay, root problems, or leaning trunks). The tree assessment method classifies trees into four ascending classes of failure probability with 1 indicating low probability and 4 extreme probability of failure.

In 2017, Hurricane Irma made landfall in Florida causing extensive destruction. Following the hurricane, the trees were revisited to assess storm damage.

You’ll use a modified version of this survey dataset to practice some of the techniques form the lab walkthrough.

Retrieve the trees_FL.csv file and read it into a data.frame object called dat_fl.

  • Q15 (5 pts.): Perform a graphical exploration of the dataset. Create the following plots and include them in your report. You may create separate figures, or combine them into one multi-panel figure.
    • A barplot of counts of trees in each probability of failure class (column ProbabilityofFailure.
    • A barplot of the counts of trees in each of the failure classes (column Failure_Standardized)
    • A histogram of DBH
    • A scatterplot of DBH (x-axis) and tree height (y axis)

The table() and barplot() functions will be your friends.

Q16-17 Florida Trees: Compare Distributions

Three types of failure were recorded in the survey. A question of interest is whether there are differences in the characteristics of trees with each type of failure.

For example, we can compare the distributions of DBH within each failure category.

The distribution of DBH for the trees with branch failure is clearly different than the distributions for whole-tree failures and intact trees. The ‘none’ and ‘whole tree’ classes may be more similar.

Conduct a Kolmogorov-Smirnov test on the distributions of DBH for intact trees and the trees with whole-tree failures.

  • Q16 (1 pt.): State the null hypothesis for the Kolmogorov-Smirnov test. Your answer should be in terms of the DBH of the two groups of trees.
  • Q17 (1 pt.): What was the p-value of the test? Based on the evidence, do you think the distribution of DBH is the same for the two groups?

HINT: It may be convenient to create subsets of the FL tree data, one each for the none and whole failure types.

Challenge Try to re-create my density plot.

Q18-20 Florida Trees Correlations

Take a look at your DBH/tree height scatterplot. Do you see a relationship between the two variables?

Let’s do a formal test to test to support our graphical intuition.

  • Q18 (1 pt.): Qualitatively describe the shape of the relationship betwen DBH and height. Is it linear? Curved? Monotonic?
  • Q19 (1 pt.): Given your answer to the previous question, which type of correlation coefficient is most appropriate?
  • Q20 (1 pt.): What is the p-value? Do you conclude that the two variables are significantly correlated?

Q21-25 Florida Trees: Risk Rating

One of the goals of a study on the trees was to determine whether the probability of failure ratings were effective. In other words, were trees with a higher probability of failure rating more likely to experience a failure?

We can use a Chi-square test on a contingency table of failure types and failure probability ratings:

Probability of Failure
1 2 3 4
Failure Type branch 97 154 156 31
none 1239 670 421 53
whole 71 65 50 32

Let’s simplify our table (and the interpretation of our results) by combining the branch and whole tree failures into a single category.

We can use some slightly ugly R code to accomplish this:

dat_fl$fail = factor(dat_fl$Failure_Standardized != "none")

levels(dat_fl$fail) = c("No Fail", "Fail")

Now our contingency table looks like:

fl_table_2 = table(
  dat_fl$ProbabilityofFailure,
  dat_fl$fail)
fl_table_2
##    
##     No Fail Fail
##   1    1239  168
##   2     670  219
##   3     421  206
##   4      53   63

Conduct a chi-square test on this contingency table and calculate the chi-square residuals.

Hint: Review the chi-square test code in the walkthrough.

  • Q21 (2 pts.): What was the value of the test statistic (X-squared)? What was the corresponding p-value?
  • Q22 (1 pt.): What is the value of the chi-square residual (rounded to the nearest whole number) for the count of failures in probability category 1?
  • Q23 (1 pt.): Were there more, or fewer, tree failures than expected by chance in failure probability category #1?
  • Q24 (1 pt.): Were there more, or fewer, tree failures than expected by chance in failure probability category #4?
  • Q25 (2 pts.): Given your answers to the previous two questions, do you conclude that the probability of failure rating system is effective?