In our scenario, we want to test the grazing preferences for cows in two different pastures, and with three grass height varieties. Embedded into our scenario are several questions of interest:
In ANOVA terms, we realize that our data has information about two factors: the grass height and the pasture.
In ANOVA terms we can re-formulate the questions as:
The first two questions are straightforward, but the third is trickier.
Fortunately we can answer these questions with 2-way interactive ANOVA.
This code will create the grazing data for the examples below:
There are three grass varieties: short, medium, and tall
grass_levels = c("short", "med", "tall")
There are two pastures: upper and lower
pasture_levels = c("upper", "lower")
This is a shortcut to build a matrix of factor levels for a balanced anova design:
expand.grid(
replicate = 1:3,
grass = grass_levels,
pasture = pasture_levels)
replicate grass pasture
1 1 short upper
2 2 short upper
3 3 short upper
4 1 med upper
5 2 med upper
6 3 med upper
7 1 tall upper
8 2 tall upper
9 3 tall upper
10 1 short lower
11 2 short lower
12 3 short lower
13 1 med lower
14 2 med lower
15 3 med lower
16 1 tall lower
17 2 tall lower
18 3 tall lower
The grazer abundance data are:
abundance = c(9,11,6,14,17,19,28,31,32,7,6,5,14,17,15,44,38,37)
Finally we can combine everything into a single data frame:
grazing_data =
data.frame(
abundance,
expand.grid(
replicate = 1:3,
grass = grass_levels,
pasture = pasture_levels))
The complete code to build the grazing data
grass_levels = c("short", "med", "tall")
pasture_levels = c("upper", "lower")
expand.grid(
replicate = 1:3,
grass = grass_levels,
pasture = pasture_levels)
replicate grass pasture
1 1 short upper
2 2 short upper
3 3 short upper
4 1 med upper
5 2 med upper
6 3 med upper
7 1 tall upper
8 2 tall upper
9 3 tall upper
10 1 short lower
11 2 short lower
12 3 short lower
13 1 med lower
14 2 med lower
15 3 med lower
16 1 tall lower
17 2 tall lower
18 3 tall lower
abundance = c(9,11,6,14,17,19,28,31,32,7,6,5,14,17,15,44,38,37)
grazing_data =
data.frame(
abundance,
expand.grid(
replicate = 1:3,
grass = grass_levels,
pasture = pasture_levels))
head(grazing_data)
abundance replicate grass pasture
1 9 1 short upper
2 11 2 short upper
3 6 3 short upper
4 14 1 med upper
5 17 2 med upper
6 19 3 med upper
We know that our data contain two predictor variables:
We like to use Analysis of Variance with data sets that have a factorial design, preferably a balanced factorial design.
We call each possible factor level combination a cell.
R has a convenient command we can use to check on the factorial structure:
xtabs(
~grass + pasture,
data = grazing_data)
grass | upper | lower | |
---|---|---|---|
short | short | 3 | 3 |
med | med | 3 | 3 |
tall | tall | 3 | 3 |
It looks like we have some observations in each of the cells, and that the numbers of observations are the same in each cell.
We have a balanced, fully-crossed factorial sampling design (the best possible kind)!
To clarify what we mean by factor level combinations and cells, let’s look at a few examples in our data.
We have some observations of cows in the short grass region in the lower pasture. This cell combines the short level of the grass factor and the lower level of the pasture factor.
How would you describe the cell (in regular-person English) that contained data from the medium level of the grass and the upper level of the pasture factor?
Do the cows prefer the upper or lower pasture?
xtabs(abundance ~ pasture, data = grazing_data)
pasture | Freq |
---|---|
upper | 167 |
lower | 183 |
It looks like they spend more time in the lower pasture, but the difference doesn’t seem very large. We’ll need to perform a statistical test to quantify our evidence.
Do cows like all of the grasses?
xtabs(abundance ~ grass, data = grazing_data)
grass | Freq |
---|---|
short | 44 |
med | 96 |
tall | 210 |
They seem to spend more time in the tall grass, but we need to perform a statistical test to quantify our evidence.
grass | upper | lower |
---|---|---|
short | 26 | 18 |
med | 50 | 46 |
tall | 91 | 119 |
NOTE: since we have a balanced design we can compare the counts in the cells directly.
grass | upper | lower |
---|---|---|
short | 8.67 | 6.00 |
med | 16.67 | 15.33 |
tall | 30.33 | 39.67 |
If we had an unbalanced design, the table of means would allow us to compare cells directly, while the table of sums would be misleading. Why?
These are very basic boxplots. You would want to customize them for inclusion in assignments or publications.
par(mfrow = c(1, 2))
boxplot(
abundance ~ grass,
data = grazing_data)
boxplot(
abundance ~ pasture,
data = grazing_data)
boxplot(
abundance ~ grass + pasture,
data = grazing_data)
Ggplot makes nicer looking boxplots with much shorter code than base R. The syntax is very different from the code we’ve used in class.
I’m providing these as examples (with a template below), but we won’t go over details of ggplot syntax in class.
Part of the ggplot philosophy is that plots can be built up from individual components. You can combine these individual components to create a final product.
We’ll be using the grazing data for all of our plots. For each boxplot, the abundance
column will be displayed on the y-axis. We can create a ggplot
object that keeps track of the data frame and the response variable:
gg_graze = ggplot(grazing_data, aes(y = abundance))
We can create a theme
object to apply to all of the final graphs. If we wanted to, we could change our theme, and the easily apply the changes to each plot we create. For now, we’ll jsut specify that the legend will go on the bottom:
t1 = theme(legend.position = "bottom")
To make the boxplots, we can use geom_boxplot()
to extend our basic ggplot object.
aes()
to specify the grouping factor using the x argument:box_grass =
gg_graze +
geom_boxplot(aes(x = grass))
box_past =
gg_graze +
geom_boxplot(aes(x = pasture))
To actually display the plot, simply call the object you created:
box_grass
grass
are separated along the x-axis.par(mfrow = c(1, 2))
doesn’t work with ggplot objects.
Instead we can use the plot_grid()
function from package cowplot
.
plot_grid(box_grass, box_past, nrow = 1)
First create the linear model objects:
fit_pasture = lm(
abundance ~ pasture,
data = grazing_data)
fit_grass = lm(
abundance ~ grass,
data = grazing_data)
fit_additive = lm(
abundance ~ pasture + grass,
data = grazing_data)
fit_interactive = lm(
abundance ~ pasture * grass,
data = grazing_data)
Abundance modeled by pasture:
Analysis of Variance Table
Response: abundance
Df Sum Sq Mean Sq F value Pr(>F)
pasture 1 14.22 14.222 0.0874 0.7713
Residuals 16 2602.22 162.639
Abundance modeled by grass:
Analysis of Variance Table
Response: abundance
Df Sum Sq Mean Sq F value Pr(>F)
grass 2 2403.11 1201.56 84.484 6.841e-09 ***
Residuals 15 213.33 14.22
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model, without an interaction
Analysis of Variance Table
Response: abundance
Df Sum Sq Mean Sq F value Pr(>F)
pasture 1 14.22 14.22 1.000 0.3343
grass 2 2403.11 1201.56 84.484 1.536e-08 ***
Residuals 14 199.11 14.22
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model, with an interaction
Analysis of Variance Table
Response: abundance
Df Sum Sq Mean Sq F value Pr(>F)
pasture 1 14.22 14.22 2.4615 0.142644
grass 2 2403.11 1201.56 207.9615 4.863e-10 ***
pasture:grass 2 129.78 64.89 11.2308 0.001783 **
Residuals 12 69.33 5.78
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can think of the null hypothesis for an interaction term as meaning that ‘the differences in group means for factor 1 are the same when the data are grouped by factor 2’ and vice versa.
It’s much easier to understand the null and alternative hypotheses graphically:
Interaction plots are similar to the 2-factor boxplots, but they show group means by points and use lines to connect the points in the second group:
Interaction plots are similar to the 2-factor boxplots, but they show cell means by points and use lines to connect the points in the second group:
stat_summary()
will calculate a statistic (such as mean, or standard deviation) then plot it.
Two-way interaction plots perform two groupings. We can vary which group is the color and which is along the x-axis:
gg_graze +
stat_summary(mapping = aes(
x = pasture, colour = grass, group = grass),
fun = mean, geom = "line")
gg_graze +
stat_summary(mapping = aes(
x = grass, colour = pasture, group = pasture),
fun = mean, geom = "line")