In this exercise you’ll get a chance to practice your R-coding skills and build an understanding of the inner workings of one-way Analysis of Variance by using R to do an ANOVA “by hand”!
You will need the data file rope.csv
.
The data set comes from a study by Dr. Brian Kane (UMass) in which he was interested in evaluating the differential resistance of various climbing rope types to accidental cutting by different hand-saw blades.
The motivation behind the study was an accident in which a professional tree pruner was cutting through a branch when his blade accidentally struck and severed his climbing rope, causing him to fall to his death.
A civil suit was filed against the rope manufacturer, who promptly asked Dr. Kane to investigate the problem.
Dr. Kane set up the following controlled experiment.
He designed a two-way factorial experiment with the following predictor variables
The experiment was a fully crossed design with 5 replicates of each rope type and blade type.
He reproduced the circumstance of the accident with the following setup:
For each trial, he measured several response variables, including:
For the purposes of this exercise, we will create a simple one-way ANOVA with:
Note: You could extend the following procedure to more complex experimental designs such as a two-way ANOVA.
Dr. Kane wants to use the results of this study to help inform the design of additional experiments, perhaps to examine additional rope types, rope tension levels, and/or other rope conditions.
His primary concern is to ensure that he designs an experiment that has adequate statistical power to detect an effect if in fact there is one.
Given that the subsequent experiments might involve slightly different treatments (e.g., additional rope types).
Before we try to build a simulation of statistical power (in the next lab), we’ll familiarize ourselves with the data by performing an Analysis of Variance by hand.
You may use this code template to keep track of your progress.
Fill in the code as you go.
rm(list = ls())
rope = ....
rope$rope.type = factor(....
n_obs = ....
n_groups = ....
ss_tot = ....
df_tot = ....
agg_sum_sq_resids = ....
ss_within = ....
df_within = ....
ss_among = ....
df_among = ....
ms_within = ....
ms_among = ....
f_ratio = ....
f_pval = ....
First, we need to know how many levels there are to our independent
treatment factor, rope.type
, and the total number of
observations (i.e., sample size).
rope.type
is a factor, and not a string
variable.factor()
to convert a string into a factor.levels()
to view the different levels within a
factor.Self test: when you have correctly read in your data and factorized
rope.type
you should get this output when you run
levels(rope$rope.type)
:
levels(rope$rope.type)
## [1] "BLAZE" "BS" "PI" "SB" "VEL" "XTC"
You will need to know the total number of observations and the number of groups for the ANOVA.
Recall that when we use a categorical predictor, we can also think if it as a grouping factor. Individual observations that have the same value for the categorical variable are within the same group.
Calculate the total number of observations and save it to a
variable called n_obs
.
Calculate the number of groups (rope types) and save it in a
variable called n_groups
.
n_obs = ....
n_groups = ....
Next, we need to partition the total variance in the response variable into its components:
Here, the trick is to realize that the total variance is the sum of the among-group variance and within-group variance, so by calculating any two of these components, we automatically know the third.
We begin by calculating the “sums of squares” for the entire data set; i.e., the squared deviation of each observation from the overall mean, since this is easiest calculation.
This is a measure of total variability in the data set, and we often
abbreviate this as ss_tot
:
Calculate the total sum of squares and save it to a
variable called ss_tot
:
ss_tot = ....
What are the degrees of freedom? Remember we lose a degree of freedom for each parameter we estimate.
How many parameters did we have to estimate to get the total sum of squares?
Next, we need to calculate the the sums of squares within groups (rope types).
You can think of this as a refinement of the model we used to
calculate ss_tot
:
ss_tot
).You can gain intuition about ss_tot and ss_within by comparing boxplots of the unconditioned data and the data conditioned on the grouping factor:
The SS within is also called the sum of squares error (SSE), or the sum of squares of residuals.
Calculating the within-group SS isn’t as easy as calculating
ss_tot
.
For each group we must:
We can use aggregate()
to accomplish this for us, but
the implementation will take some finessing. You should read the help
entry for aggregate()
now.
As a first approximation, we can calculate the group means with
aggregate()
:
aggregate(
x = rope$p.cut,
by = list(rope$rope.type),
FUN = mean)
## Group.1 x
## 1 BLAZE 0.3671429
## 2 BS 0.2370000
## 3 PI 0.1870000
## 4 SB 0.2720000
## 5 VEL 0.3500000
## 6 XTC 0.2655000
Notice the argument FUN = mean
. This tells
aggregate()
to use mean()
directly on the
subsets of the data (the groups).
I can be more explicit using this syntax:
aggregate(
x = rope$p.cut,
by = list(rope$rope.type),
FUN = function(x) mean(x))
## Group.1 x
## 1 BLAZE 0.3671429
## 2 BS 0.2370000
## 3 PI 0.1870000
## 4 SB 0.2720000
## 5 VEL 0.3500000
## 6 XTC 0.2655000
This time I used FUN = function(x) mean(x)
.
The syntax should be familiar to you from all of your function-building experience.
How could I change the FUN
argument syntax to calculate
the within-group residuals?
First, try to modify my code to calculate the residuals. Save the
output of your call to aggregate()
to a variable called
agg_resids
.
You can test your work by comparing the structure of your
agg_resids
object to mine:
str(agg_resids)
## 'data.frame': 6 obs. of 2 variables:
## $ Group.1: Factor w/ 6 levels "BLAZE","BS","PI",..: 1 2 3 4 5 6
## $ x :List of 6
## ..$ : num 0.633 0.633 0.623 0.173 0.143 ...
## ..$ : num 0.303 0.223 0.193 0.183 0.093 ...
## ..$ : num 0.363 0.133 0.113 0.103 0.083 0.063 0.053 0.053 0.033 -0.007 ...
## ..$ : num 0.398 0.238 0.178 0.168 0.168 0.138 0.118 0.118 0.048 0.038 ...
## ..$ : num 0.65 0.36 0.3 0.22 0.16 ...
## ..$ : num 0.3545 0.3145 0.2745 0.2545 0.0745 ...
The function you create with the FUN =
argument is
called an anonymous function.
aggregate()
has finished.Now modify your aggregate()
call to calculate
sums of squared residuals within each group.
agg_resids
:
Save the output to a variable called
agg_sum_sq_resids
.
As a self test, see if your output matches mine:
str(agg_sum_sq_resids)
## 'data.frame': 6 obs. of 2 variables:
## $ Group.1: Factor w/ 6 levels "BLAZE","BS","PI",..: 1 2 3 4 5 6
## $ x : num 1.808 0.405 0.312 0.633 1.129 ...
Now you can use the $ subsetting operator to extract the group SSs (and then calculate their sum).
Note, the final sum is the sums of squares error pooled across groups.
Save your within-group sum of squares to a variable called
ss_within
.
Remember the logic for calculating the total degrees of freedom?
How can you calculate the within-group df using similar logic?
Now we can compute the remaining variance component, “sums of squares among”, which is often abbreviated SSA:
ss_among = ss_tot - ss_within
The extent to which ss_within
is less than
ss_tot
is a reflection of the magnitude of the differences
between the means.
If ss_within
is much smaller than ss_tot
,
then most of the variation in the response is due to differences among
groups (or levels of the independent factor).
Another way of looking at this is that as the ratio of
ss_among
to ss_within
increases, then an
increasing amount of the variation is due to differences among
groups.
It is tempting to think we are done, but we are not.
We’ve briefly touched upon the concept of normalization, here’s a chance to see it in action! In this case, we can’t compare the sums of squares directly because the numbers of groups are different than the total number of observations. We know that larger samples will have larger sums of squared residuals, even if the variance is the same. When we normalize a sum of squared deviations from a mean we end up with our old friend variance.
We can compare the within-group and among-group variances directly, whereas we cannot directly compare the within-group and among-group sums of squares.
We need to adjust the sums of squares to reflect the degrees of freedom available given the number of treatments (or levels of the independent factor) and the number of replicates per treatment.
You calculated the total number of observations above and saved it as
n_obs
. To calculate variances we have to convert numbers of
observations to degrees of freedom. The total degrees of freedom is just
n_obs - 1
.
We lose 1 d.f. because in calculating ss_tot
we had to
estimate one parameter from the data in advance: the grand mean.
df_tot
.Now we need to know how many degrees of freedom to allocate to
ss_within
.
Fortunately we can use a shortcut.
We know that for each group we lose one degree of freedom because we calculated a group mean.
The within-group degrees of freedom is just the number of observations minus the number of groups!
df_within
Therefore we can calculate the there were in each of the groups (rope types).
Five of the rope types had n = 20 replications, so they each had
\(20-1=19\) d.f., for error because we
estimate one parameter from the data for each rope type, namely the
treatment means, in calculating their contribution to
ss_within
.
One of the rope types had \(n = 21\)
replications, so it had \(21-1=20\)
d.f. to contribute to the overall ss_within
.
Overall, therefore, the error has \(5 \times 19 + 1 \times 20 = 115\) d.f.
Lastly, there were six rope types, so there are \(6 - 1 = 5\) d.f. for rope.type
(ss_among
).
The mean squares are obtained simply by dividing each sum of squares by its respective degrees of freedom, as follows:
ms_among = ss_among / (n_groups - 1)
ms_within = ss_within / (n_obs - n_groups)
By tradition, we do not calculate the total mean square.
The test statistic is the F-ratio, defined as the among-group variance divided by the within-group variance.
f_ratio
The F-ratio is used to test the null hypothesis that the treatment means are all the same.
If we reject this null hypothesis, we accept the alternative hypothesis that at least one of the means is significantly different from the others.
The question naturally arises at this point as to whether the observed F-ratio is a big number or not.
If it is a big number, then we reject the null hypothesis.
If it is not a big number, then we fail to reject the null hypothesis.
As ever, we decide whether the test statistic is big or small by comparing it to the values from an F probability distribution, given the number of degrees of freedom in the numerator and the number of degrees of freedom in the denominator.
Specifically, we want to know the Type 1 error rate (p-value); i.e., the probability of observing an F-ratio as large as ours given that the null hypothesis is true and thus the treatment means are the same.
To do this, we use pf()
for cumulative probabilities of
the F distribution, like this:
In our case the numerator was the among-group variance and the denominator was the within-group variance.
Look up the help entry for pf()
pf()
.OK, now that you know how to perform a 1-way ANOVA manually, here is
how you do it the easy way using anova()
with a fitted
model object.
fit_1 = lm(p.cut ~ rope.type, data=rope)
anova(fit_1)
## Analysis of Variance Table
##
## Response: p.cut
## Df Sum Sq Mean Sq F value Pr(>F)
## rope.type 5 0.4729 0.094577 2.2312 0.05582 .
## Residuals 115 4.8747 0.042389
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You will need to be able to extract the p-value and other relevant quantities from the ANOVA table.
You can use str()
to reveal the structure of the anova
table object and use the subset operator $ to extract the values of
interest:
anova_fit_1 = anova(fit_1)
str(anova_fit_1)
## Classes 'anova' and 'data.frame': 2 obs. of 5 variables:
## $ Df : int 5 115
## $ Sum Sq : num 0.473 4.875
## $ Mean Sq: num 0.0946 0.0424
## $ F value: num 2.23 NA
## $ Pr(>F) : num 0.0558 NA
## - attr(*, "heading")= chr [1:2] "Analysis of Variance Table\n" "Response: p.cut"
You might notice that some of the named elements have names that include spaces or other characters that we normally use as operators or symbols in R code.
What happens when you try to extract the Sum Sq
element?
anova_fit_1$Sum Sq
## Error: <text>:1:17: unexpected symbol
## 1: anova_fit_1$Sum Sq
## ^
Well, that didn’t exactly work. What about using quotes?
anova_fit_1$"Sum Sq"
## [1] 0.4728867 4.8746836
Success!
Recall the ANOVA null hypothesis:
The observations within all groups were sampled from the same population.
In other words: “There are no differences in group means.”
Unlike with the t-test, a directional alternative hypothesis is not possible when there are 3 or more groups. Why? A typical formulation of the ANOVA alternative hypothesis is: “At least one of the group means is different from the rest.”
It’s natural to ask which group or groups are different. We can answer that question with a post-hoc test.
Tukey Honest Significant Difference (HSD) Test
A common post-hoc test is the Tukey HSD. It works by making pairwise comparisons between each possible pair of groups.
The number of pairwise comparisons grows quickly with the number of groups, so I’ll demonstrate with a model of a subset of the data:
rope2 = droplevels(
subset(
rope,
rope.type %in% c("PI", "VEL", "XTC"))
)
boxplot(
p.cut ~ rope.type,
data = rope2,
las = 2,
xlab = "",
ylab = "Proportion Rope Cut",
main = "Subset of Rope Data")
mtext("Rope Type", side = 1, line = 3)
The function for the Tukey HSD test is TukeyHSD()
.
It accepts an aov
object (not a lm
object).
Fortunately it is easy to convert a lm
to an
aov
using the aov()
function.
The output of the TukeyHSD()
function is a
TukeyHSD
object, which contains several components.
fit_rope_2 = lm(p.cut ~ rope.type, data=rope2)
rope2_hsd = TukeyHSD(aov(fit_rope_2))
class(rope2_hsd)
## [1] "TukeyHSD" "multicomp"
The most relevant part of the TukeyHSD
object is the
data.frame
of results:
round(rope2_hsd$rope.type, digits = 4)
## diff lwr upr p adj
## VEL-PI 0.1630 0.0194 0.3066 0.0224
## XTC-PI 0.0785 -0.0651 0.2221 0.3924
## XTC-VEL -0.0845 -0.2281 0.0591 0.3394
For each pair of groups, it tells us:
The p-values are adjusted for multiple testing and they represent the evidence against the null hypothesis that the group means are the same.
We can interpret the p-values for the pairs of rope types in the table:
diff | lwr | upr | p adj | |
---|---|---|---|---|
VEL-PI | 0.1630 | 0.0194 | 0.3066 | 0.0224 |
XTC-PI | 0.0785 | -0.0651 | 0.2221 | 0.3924 |
XTC-VEL | -0.0845 | -0.2281 | 0.0591 | 0.3394 |
as follows:
Examine the boxplot and see if the results make intuitive sense.
anova()
output.data
subdirectory so that I can run your code on my machine.rm(list = ls())
rope = ....
rope$rope.type = factor(rope$rope.type)
n_obs = ....
n_groups = ....
ss_tot = ....
df_tot = ....
agg_sum_sq_resids = ....
ss_within = ....
df_within = ....
ss_among = ....
df_among = ....
ms_within = ....
ms_among = ....
f_ratio = ....
f_pval = ....
We’ll use the following script to test your answers. You can use it as a self-test prior to submitting your answer.
==
) will return
values of TRUE
if your calculations are correct.# number comparison tolerance
digits_check = 5
# Build the reference model using R functions
fit_1 = lm(p.cut ~ rope.type, data=rope)
anova(fit_1)
anova_fit_1 = anova(fit_1)
# Check degrees of freedom
anova_fit_1$Df == c(df_among, df_within)
# Check sums of squares
round(anova_fit_1$`Sum Sq`, digits = digits_check) == round(c(ss_among, ss_within), digits = digits_check)
# Check mean squares
round(anova_fit_1$`Mean Sq`, digits = digits_check) == round(c(ms_among, ms_within), digits = digits_check)
# Check the F-ratio
round(anova_fit_1$`F value`[1], digits = digits_check) == round(f_ratio, digits = digits_check)
# Check the F test statistic p-value
round(anova_fit_1$`Pr(>F)`[1], digits = digits_check) == round(f_pval, digits = digits_check)
NOTE: If your code doesn’t run, we may return the assignment to you to fix the errors before we attempt to grade it.
To grade your submission we’ll run your code in an empty R
environment to load your calculated values into memory. Then we’ll run
the grading script above. You’ll get 1 point for each TRUE
from the grading script.
We’ll use the following script to test your answers. You can use it as a self-test prior to submitting your answer.
==
) will return
values of TRUE
if your calculations are correct.# number comparison tolerance
digits_check = 5
# Build the reference model using R functions
fit_1 = lm(p.cut ~ rope.type, data=rope)
anova(fit_1)
anova_fit_1 = anova(fit_1)
# Check degrees of freedom
anova_fit_1$Df == c(df_among, df_within)
# Check sums of squares
round(anova_fit_1$`Sum Sq`, digits = digits_check) == round(c(ss_among, ss_within), digits = digits_check)
# Check mean squares
round(anova_fit_1$`Mean Sq`, digits = digits_check) == round(c(ms_among, ms_within), digits = digits_check)
# Check the F-ratio
round(anova_fit_1$`F value`[1], digits = digits_check) == round(f_ratio, digits = digits_check)
# Check the F test statistic p-value
round(anova_fit_1$`Pr(>F)`[1], digits = digits_check) == round(f_pval, digits = digits_check)
Let’s test whether our assumption of homogeneity (equal variance of the residuals within groups) is met for this analysis.
Review the material on testing for equivalence of variances from the previous lab.
For one-way ANOVA-type models, we can calculate the group means using the coefficients from the model table.
To prove to ourselves some of the connections between ANOVA,
categorical variables, dummy variables, and model coefficients, we’ll
calculate some group means using the model coefficient table and compare
them to our group mean calculations using aggregate()
from
the lab walkthrough.
First, fit a basic linear model and view the coefficient table:
fit_rope_1 = lm(p.cut ~ rope.type, data = rope)
summary(fit_rope_1)
##
## Call:
## lm(formula = p.cut ~ rope.type, data = rope)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2800 -0.1500 -0.0355 0.1030 0.6500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.36714 0.04493 8.172 4.45e-13 ***
## rope.typeBS -0.13014 0.06433 -2.023 0.04538 *
## rope.typePI -0.18014 0.06433 -2.800 0.00599 **
## rope.typeSB -0.09514 0.06433 -1.479 0.14186
## rope.typeVEL -0.01714 0.06433 -0.266 0.79033
## rope.typeXTC -0.10164 0.06433 -1.580 0.11683
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2059 on 115 degrees of freedom
## Multiple R-squared: 0.08843, Adjusted R-squared: 0.0488
## F-statistic: 2.231 on 5 and 115 DF, p-value: 0.05582
In addition to the constant variance assumption (was it met?), accurate calculation of p-values in ANOVA depends on normality of the residuals.
We’ll test for residual normality both within groups and for the overall model.
R has a handy function for obtaining the residuals from a
lm
object. Perhaps surprisingly, it’s called
residuals()
. The residuals()
function accepts
a model object, and returns a vector of the model residuals.
Testing the normality of the residuals within each group takes a bit
more coding. Fortunately, you can use your agg_resids
object. I suggest one of two approaches:
sapply()
, possibly with a custom anonymous function
(a more elegant solution).Things to consider:
str()
to examine the structure of your
agg_resids
object. Which component contains the vectors of
residuals?sapply()
accepts a list object, and applies a function
to each element in the list.residuals()
function to retrieve the residuals from your model and perform an
overall normality test. Report the p-value.In the last set of questions, we’ll use a different dataset: our good friend the Palmer penguins!
For these questions, we’ll be working simplified version of the data (only the female penguins).
require(palmerpenguins)
pen_fem = subset(penguins, sex == "female")