Individual Assignment - Water Vole
Regression Intro to Quantitative Ecology:
You are studying factors that may influence water vole (Arvicola amphibius) populations.
You’ve learned how to create and interpret simple and multiple linear regressions in R. You’ll use your modeling skills to create an analysis of water vole populations in an RMarkdown document.
You have a data set that includes vole counts at census sites, as well as the distance to nearest road and the percent vegetation in the census patch.
You will conduct an analysis in R
to investigate whether
the number of voles captured (Voles
) is influenced by:
PercVeg
Dist2Road
The data file is called vole_trapping.csv
, you can find
it in the assignment data section of the course GitHub page.
As with any data set, you need to take a look at its contents and
some of their basic properties with head()
and
summary()
For each of the variables (response, predictors) in the data set, you need to plot histograms to assess the distributions.
Next, you need to make scatterplots of the response and predictor variables.
In your predictor/response scatterplots, you need to assess:
Based on your observations of the predictor/response scatterplots, formulate a null and alternative hypothesis for the association between the predictor and response.
Finally, you need to check whether your two candidate predictor
variables are correlated using a pairplot. R has a built-in pairplot
function called pairs()
, but you should use the
pairs.panels()
function from package psych
.
You’ll have to install psych
first to use it.
Here’s some example code using R’s built-in pairs()
function and the pairs.panels()
from psych
on
the penguins dataset.
pairs()
The pairs()
function will accept an entire data frame
and attempt to plot each column:
require(palmerpenguins)
pairs(penguins)
This plot contains too much information, so I need to specify only a few columns to include.
What are the columns in penguins
?
head(penguins)
## species island bill_length_mm bill_depth_mm flipper_length_mm
## 1 Adelie Torgersen 39.1 18.7 181
## 2 Adelie Torgersen 39.5 17.4 186
## 3 Adelie Torgersen 40.3 18.0 195
## 4 Adelie Torgersen NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193
## 6 Adelie Torgersen 39.3 20.6 190
## body_mass_g sex year
## 1 3750 male 2007
## 2 3800 female 2007
## 3 3250 female 2007
## 4 NA <NA> 2007
## 5 3450 female 2007
## 6 3650 male 2007
I want to include the four numeric columns, so I can use the square brackets along with a character vector including their names:
head(
penguins[, c(
"bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g")])
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 39.1 18.7 181 3750
## 2 39.5 17.4 186 3800
## 3 40.3 18.0 195 3250
## 4 NA NA NA NA
## 5 36.7 19.3 193 3450
## 6 39.3 20.6 190 3650
pairs(
penguins[, c(
"bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g")])
pairs.panels()
This function produces a pair plot with nicer formatting than base
R’s pairs()
.
It also gives me the values of the correlation coefficients in the cells above the diagonal.
# I need to load the psych package (which I installed earliner).
require(psych)
# pairs.panels() uses the same syntax as pairs
pairs.panels(
penguins[, c(
"bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g")])
To remind yourself how to do this, you should create an R code chunk.
Call the chunk “Install psych Package” and make sure it includes the
code needed to install the package called psych
.
eval = FALSE
so that the code in the chunk doesn’t run when
you knit your document. Remember, you only need to install a package
once!
install.packages()
when you
knit a document, you’ll receive a cryptic error 🤔You’ll be using simple and multiple linear regression techniques to study the two predictors.
For each model you create (descriptions below), you should use the workflow:
lm()
residuals()
function.summary()
For each predictor variable (road distance and vegetation %), you need to fit an SLR using the vole count as the response.
Call your model objects fit_veg
and
fit_road
.
Examine the model coefficient table using summary()
on
your fitted model objects. Specifically you need to examine:
Now, fit an additive multiple linear regression
model using lm()
. Remember that an additive model does not
contain an interaction term.
The R syntax for the formula for an additive multiple regression uses the plus symbol to show that the effects are additive. You can find example syntax in the week 10 slides, deck 1.
fit_additive
.Examine the additive model’s coefficient table. Pay close attention to:
Each predictor’s slope coefficient.
Is it similar to the value from the corresponding Simple Linear Regression?
Did the p-value change?
The model’s r-squared values
Finally, fit an interactive multiple regression model.
The R syntax for the formula of an interactive multiple regression uses the multiplication (star) symbol to show that the effects are interacting.
fit_interactive
.Recall that a key assumption in our models is that our model
residuals are normally distributed. I can build a histogram of the
residuals of a model (fit_veg
) using the following
code:
hist(residuals(fit_veg))
I can also use the plot()
function to view residual
vs. predicted value scatterplots and Quantile-Quantile plots:
# To create two plots in the same figure:
par(mfrow = c(1, 2))
plot(fit_veg, which = 1:2)
Finally, I can conduct a formal statistical test of normality. What is the null hypothesis here?
shapiro.test(residuals(fit_veg))
##
## Shapiro-Wilk normality test
##
## data: residuals(fit_veg)
## W = 0.98984, p-value = 0.9925
Use the code templates above to perform numerical and graphical tests for normality of your model residuals.
Recall that we can use the Akaike Information Criterion (AIC) as an objective procedure to compare several models.
AIC attempts to balance the model complexity (number of terms) and model fit (correlation, r-squared value) in order to find a ‘best’ model or models.
The idea is that adding additional terms to a model always increases the r-squared value, even if the additional predictor variables don’t have significant effects (i.e. the p-values associated with their coefficients are non-significant).
Simpler models may have lower r-squared values, but they are much easier to interpret and explain to others.
The AIC procedure rewards good model fit (via sums of squares) and penalizes complicated models (the number of model terms).
When you calculate AIC values for a set of models, the model(s) with the lowest AIC are said to be the best choice(s). Sometimes a group of models have very similar AIC values.
Use the AIC()
function to calculate the AIC value for
each of the four models you created.
You’ll do your work in an RMarkdown document.
Your document needs to include:
par(mfrow = c(2, 2))
to set
up the 4-panel plot. Each histogram must have an appropriate titles and
axis labels.