Recall the simple 1- and 2-sample t-tests.
T-tests are univariate tests that we can use to determine whether we have good evidence that:
In R, the t-test syntax is simple.
Try running a 1-sample t-test on the Gentoo penguin flipper lengths:
t.test(subset(penguins, species == "Gentoo")$flipper_length_mm)
##
## One Sample t-test
##
## data: subset(penguins, species == "Gentoo")$flipper_length_mm
## t = 371.43, df = 122, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 216.0295 218.3445
## sample estimates:
## mean of x
## 217.187
Instead of comparing Gentoo penguin flipper lengths to zero, let’s test whether they are equal to 218 mm:
t.test(
x = subset(penguins, species == "Gentoo")$flipper_length_mm,
mu = 218
)
##
## One Sample t-test
##
## data: subset(penguins, species == "Gentoo")$flipper_length_mm
## t = -1.3904, df = 122, p-value = 0.1669
## alternative hypothesis: true mean is not equal to 218
## 95 percent confidence interval:
## 216.0295 218.3445
## sample estimates:
## mean of x
## 217.187
Let’s try a one-tailed alternative hypothesis: Gentoo penguin flippers are smaller than 218 mm:
t.test(
x = subset(penguins, species == "Gentoo")$flipper_length_mm,
mu = 218,
alternative = "less"
)
##
## One Sample t-test
##
## data: subset(penguins, species == "Gentoo")$flipper_length_mm
## t = -1.3904, df = 122, p-value = 0.08347
## alternative hypothesis: true mean is less than 218
## 95 percent confidence interval:
## -Inf 218.1561
## sample estimates:
## mean of x
## 217.187
Instead of comparing the flipper length of Gentoo penguins to a fixed value, I could compare the flipper lengths of two species:
t.test(flipper_length_mm ~ species, data = subset(penguins, species != "Chinstrap"))
##
## Welch Two Sample t-test
##
## data: flipper_length_mm by species
## t = -34.445, df = 261.75, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Adelie and group Gentoo is not equal to 0
## 95 percent confidence interval:
## -28.79018 -25.67652
## sample estimates:
## mean in group Adelie mean in group Gentoo
## 189.9536 217.1870
Referring back to the directional alternative hypothesis with a 1-sample t-test above, how could I modify the code to test the alternative hypothesis that Adelie penguins have shorter than Gentoo penguins?
To gain some experience with 1-way Analysis of Variance, you’ll be working with the penguin data.
Response variable: body mass
continuous variable
ratio scale
Predictor variable: species
categorical variable
nominal scale
To perform an ANOVA in R, you can use this procedure:
lm()
summary()
anova()
I’ll walk through an example using penguin body mass and species.
You can review the slide deck for Interpreting model coefficient and ANOVA tables for further examples and a guide to interpreting model coefficients.
We can explore normality using histograms and density plots:
par(mfrow = c(1, 2))
hist(penguins$body_mass_g, breaks = 80, main = "histogram of body mass", xlab = "body mass (g)")
plot(density(penguins$body_mass_g, na.rm = TRUE), main = "density plot of body mass")
What do you notice?
Conditional boxplots are great for categorical variables:
require(palmerpenguins)
boxplot(body_mass_g ~ species, data = penguins)
What do you notice?
Remember the assumption of normality? Let’s test whether within-group data are normally-distributed. We need to do some data massaging:
dat_chinstrap = subset(penguins, species == "Chinstrap")
mean(dat_chinstrap$body_mass_g, na.rm = TRUE)
## [1] 3733.088
shapiro.test(dat_chinstrap$body_mass_g)
##
## Shapiro-Wilk normality test
##
## data: dat_chinstrap$body_mass_g
## W = 0.98449, p-value = 0.5605
Shapiro test null hypothesis: “The data are drawn from a normally-distributed population.”
If we fail to reject the null, can we consider the data sufficiently normal to proceed?
Here’s a cool shortcut for calculating the species mean body masses
using aggregate()
and the formula notation:
aggregate(body_mass_g ~ species, data = penguins, FUN = mean)
## species body_mass_g
## 1 Adelie 3700.662
## 2 Chinstrap 3733.088
## 3 Gentoo 5076.016
shapiro.test()
! (click to
show/hide)
aggregate(
body_mass_g ~ species,
data = penguins,
FUN = function(x) shapiro.test(x)$p.value)
## species body_mass_g
## 1 Adelie 0.03239702
## 2 Chinstrap 0.56050824
## 3 Gentoo 0.23361649
The syntax is easy if you use the formula notation:
fit_species = lm(body_mass_g ~ species, data = penguins)
Then we can look at the model coefficients:
summary(fit_species)
##
## Call:
## lm(formula = body_mass_g ~ species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1126.02 -333.09 -33.09 316.91 1223.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3700.66 37.62 98.37 <2e-16 ***
## speciesChinstrap 32.43 67.51 0.48 0.631
## speciesGentoo 1375.35 56.15 24.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 462.3 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6697, Adjusted R-squared: 0.6677
## F-statistic: 343.6 on 2 and 339 DF, p-value: < 2.2e-16
R makes it easy to conduct an ANOVA:
anova(fit_species)
## Analysis of Variance Table
##
## Response: body_mass_g
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 146864214 73432107 343.63 < 2.2e-16 ***
## Residuals 339 72443483 213698
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A simple Analysis of Variance belongs to the Group 1 analyses. It is a linear model!
The syntax to build the model in R is easy. Use the lm() function to store the model in a variable:
fit_species = lm(body_mass_g ~ species, data = penguins)
Note the use of the R formula notation.
Then we can look at the model coefficients:
summary(fit_species)
##
## Call:
## lm(formula = body_mass_g ~ species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1126.02 -333.09 -33.09 316.91 1223.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3700.66 37.62 98.37 <2e-16 ***
## speciesChinstrap 32.43 67.51 0.48 0.631
## speciesGentoo 1375.35 56.15 24.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 462.3 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6697, Adjusted R-squared: 0.6677
## F-statistic: 343.6 on 2 and 339 DF, p-value: < 2.2e-16
R makes it easy to conduct an ANOVA:
anova(fit_species)
## Analysis of Variance Table
##
## Response: body_mass_g
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 146864214 73432107 343.63 < 2.2e-16 ***
## Residuals 339 72443483 213698
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now remember the two model summary tables we have discussed: Model coefficient tables and ANOVA tables.
Remember the interpretation of linear model coefficients for a categorical variable:
For the penguin body mass and species model, what is the base case?
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 3700.662 | 37.619 | 98.371 | 0.000 |
speciesChinstrap | 32.426 | 67.512 | 0.480 | 0.631 |
speciesGentoo | 1375.354 | 56.148 | 24.495 | 0.000 |
We’ve interpreted the regression coefficients for our model of penguin body masses as predicted by species.
Now we’ll take a look at the other common way model results are displayed, the ANOVA table.
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
species | 2 | 146864214 | 73432107.1 | 343.6263 | 0 |
Residuals | 339 | 72443483 | 213697.6 |
The sum of squares columns tell us about how much of the total data variability is explained by each of the predictor variables.
In our case we have only 1, the species.
The residuals sum of squares contains information about the variation that our model couldn’t explain.
R doesn’t print a line with total sum of squares, but it is equal to the sum of all the numbers in the Sum sq. column.
The total sum of squares is a measure of the total variability in the data.
The Sum Sq column tells us about the variability explained by each factor, but the Mean sq column allows us to compare the relative amount of information that each factor explains.
Differences between ANOVA and model coefficient tables:
To summarize the interpretation of this model’s ANOVA table:
Did you notice that the conditional boxplot of body mass and penguin species suggested that the distributions of body mass might not be the same in each species? For reference, take a look at the shape of the boxes for Chinstrap and Gentoo penguins:
boxplot(body_mass_g ~ species, data = penguins)
You might have had a hunch that perhaps penguin sex is also related to body mass. We can plot another conditional boxplot, this time conditioned on two categorical variables: species and sex:
After you look at this plot, do you think sex could be important?
We’ll first fit a model with two predictors:
In this first 2-way model, we won’t include interaction terms.
The syntax to fit a model with two predictors, without interaction terms, is:
fit_additive = lm(body_mass_g ~ sex + species, data = penguins)
Some things to note:
Here’s our last ANOVA-style model.
The syntax to build the model is only slightly different than that of the additive model: You just have to replace the plus symbol with a multiplication symbol to indicate an interaction (or crossing) between the two factor predictors (sex and species).
fit_interactive = lm(body_mass_g ~ sex .....)
You need to fill in the rest of the code yourself to complete the assignment.
When you have the model specified correctly, your model coefficient table should look like:
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 3368.836 | 36.212 | 93.030 | 0.000 |
sexmale | 674.658 | 51.212 | 13.174 | 0.000 |
speciesChinstrap | 158.370 | 64.240 | 2.465 | 0.014 |
speciesGentoo | 1310.906 | 54.422 | 24.088 | 0.000 |
sexmale:speciesChinstrap | -262.893 | 90.849 | -2.894 | 0.004 |
sexmale:speciesGentoo | 130.437 | 76.436 | 1.706 | 0.089 |
require(palmerpenguins)
lm(bill_length_mm ~ body_mass_g, data = penguins)
##
## Call:
## lm(formula = bill_length_mm ~ body_mass_g, data = penguins)
##
## Coefficients:
## (Intercept) body_mass_g
## 26.898872 0.004051
Q1 (4 pts.): Re-create the conditional boxplot of penguin body mass conditioned on sex and species. Include your boxplot as a figure in your report.
names
argument.\n
character to split your
species/sex labels across two lines.Q2 (2 pts.): Based on the boxplots, do you think male penguins are significantly heavier than female penguins of the same species? Explain your reasoning, and be sure to explain why you think any observed differences are significant or not.
Q3 (2 pts.): Do you think adding sex to a model that already includes species will improve the model fit?
lm()
to a variable called
fit_both
.Warning: in this case we need to fit a 2-way ANOVA model with interactions. You have to use the star symbol instead of the plus symbol to make sure that R includes all of the factor combinations.
When you fit a model with two categorical predictors, the base case is now a combination of the base level of predictor one and the base level of predictor two.
summary(fit_both)
to examine the model coefficient
table for a clue.fit_both
.
Recall that you can use the model coefficients to calculate group means when you have a model with categorical predictors.
summary(fit_both)
to examine the values of the
coefficients, and their p-values, in the model coefficient table.Q6 (2 pts.): What are the names of the two coefficients (from the first column of the coefficient table) that you need to calculate the average mass of female Chinstrap penguins?
Q7 (2 pts.): What is the predicted average mass of female Chinstrap penguins in the interactive model?
Q8 (2 pts.): What is the observed average mass of female Chinstrap penguins, calculated from the penguins data?
aggregate()
function for an elegant approach that does not
use logical subsetting.