Overview

You are studying factors that may influence water vole (Arvicola amphibius) populations.

You’ve learned how to create and interpret simple and multiple linear regressions in R. You’ll use your modeling skills to create an analysis of water vole populations in an RMarkdown document.

Notes on the assignment:

This is an individual assignment, but I encourage you to work with your peers.
Even though you will work with your peers, every answer must be in your own words.

Vole Census Data

You have a data set that includes vole counts at census sites, as well as the distance to nearest road and the percent vegetation in the census patch.

You will conduct an analysis in R to investigate whether the number of voles captured (Voles) is influenced by:

the % vegetation in each habitat patch, PercVeg
the distance to the nearest road , Dist2Road

The data file is called vole_trapping.csv, you can find it in the assignment data section of the course GitHub page.

Graphical and Numerical Exploration

As with any data set, you need to take a look at its contents and some of their basic properties with head() and summary()

For each of the variables (response, predictors) in the data set, you need to plot histograms to assess the distributions.

Next, you need to make scatterplots of the response and predictor variables.

In your predictor/response scatterplots, you need to assess:

Is there a clear association in the plot?
If so, does the trend appear to be linear? Monotonic?
Is the relationship positive, negative, or none?

Based on your observations of the predictor/response scatterplots, formulate a null and alternative hypothesis for the association between the predictor and response.

Finally, you need to check whether your two candidate predictor variables are correlated using a pairplot. R has a built-in pairplot function called pairs(), but you should use the pairs.panels() function from package psych. You’ll have to install psych first to use it.

Here’s some example code using R’s built-in pairs() function and the pairs.panels() from psych on the penguins dataset.

`pairs()`

The pairs() function will accept an entire data frame and attempt to plot each column:

require(palmerpenguins)
pairs(penguins)

This plot contains too much information, so I need to specify only a few columns to include.

What are the columns in penguins?

head(penguins)

##   species    island bill_length_mm bill_depth_mm flipper_length_mm
## 1  Adelie Torgersen           39.1          18.7               181
## 2  Adelie Torgersen           39.5          17.4               186
## 3  Adelie Torgersen           40.3          18.0               195
## 4  Adelie Torgersen             NA            NA                NA
## 5  Adelie Torgersen           36.7          19.3               193
## 6  Adelie Torgersen           39.3          20.6               190
##   body_mass_g    sex year
## 1        3750   male 2007
## 2        3800 female 2007
## 3        3250 female 2007
## 4          NA   <NA> 2007
## 5        3450 female 2007
## 6        3650   male 2007

I want to include the four numeric columns, so I can use the square brackets along with a character vector including their names:

head(
  penguins[, c(
    "bill_length_mm",
    "bill_depth_mm",
    "flipper_length_mm",
    "body_mass_g")])

##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1           39.1          18.7               181        3750
## 2           39.5          17.4               186        3800
## 3           40.3          18.0               195        3250
## 4             NA            NA                NA          NA
## 5           36.7          19.3               193        3450
## 6           39.3          20.6               190        3650

pairs(
  penguins[, c(
    "bill_length_mm",
    "bill_depth_mm",
    "flipper_length_mm",
    "body_mass_g")])

`pairs.panels()`

This function produces a pair plot with nicer formatting than base R’s pairs().

It also gives me the values of the correlation coefficients in the cells above the diagonal.

# I need to load the psych package (which I installed earliner).
require(psych)


# pairs.panels() uses the same syntax as pairs
pairs.panels(
  penguins[, c(
    "bill_length_mm",
    "bill_depth_mm",
    "flipper_length_mm",
    "body_mass_g")])

Installing Packages

To remind yourself how to do this, you should create an R code chunk. Call the chunk “Install psych Package” and make sure it includes the code needed to install the package called psych.

This chunk will likely be a helpful reminder for you in a later assignment.
You should set the chunk option eval = FALSE so that the code in the chunk doesn’t run when you knit your document. Remember, you only need to install a package once!
- In fact, if you try to run install.packages() when you knit a document, you’ll receive a cryptic error 🤔

Linear Regression Models

You’ll be using simple and multiple linear regression techniques to study the two predictors.

Workflow for Code Chunks

For each model you create (descriptions below), you should use the workflow:

Create a second-level heading (two # symbols) with a very short descriptive title.
- For example, I might call one of my sections “Additive Model: Vegetation and Road Distance”
Create a code chunk in which you:
- Build a model object using lm()
Create a code chunk in which you perform some simple model diagnostics. You can find template code below.
- Examine a histogram of the model’s residuals.
- Note that you can fetch the model’s residuals using the residuals() function.
- Perform a statistical test to determine whether the model residuals are Normally distributed.
Create a code chunk in which you:
- Examine the model coefficient table produced by summary()
- Make sure you understand what the coefficient values mean: be able to translate them into English.

Simple Linear Regressions

For each predictor variable (road distance and vegetation %), you need to fit an SLR using the vole count as the response.

Call your model objects fit_veg and fit_road.

Examine the model coefficient table using summary() on your fitted model objects. Specifically you need to examine:

The predictor’s coefficient value: the slope.
Is it positive or negative (or zero)?
What is the significance level of the coefficient?
How would you translate the coefficient value into an English statement?
The model’s r-squared values

Muliple Linear Regression: Additive Model

Now, fit an additive multiple linear regression model using lm(). Remember that an additive model does not contain an interaction term.

The R syntax for the formula for an additive multiple regression uses the plus symbol to show that the effects are additive. You can find example syntax in the week 10 slides, deck 1.

Save your model to a variable called fit_additive.

Examine the additive model’s coefficient table. Pay close attention to:

Each predictor’s slope coefficient.
Is it similar to the value from the corresponding Simple Linear Regression?
Did the p-value change?
The model’s r-squared values

Muliple Linear Regression: Interactive Model

Finally, fit an interactive multiple regression model.

The R syntax for the formula of an interactive multiple regression uses the multiplication (star) symbol to show that the effects are interacting.

Save your model to a variable called fit_interactive.

Model Diagnostics

Recall that a key assumption in our models is that our model residuals are normally distributed. I can build a histogram of the residuals of a model (fit_veg) using the following code:

hist(residuals(fit_veg))

I can also use the plot() function to view residual vs. predicted value scatterplots and Quantile-Quantile plots:

# To create two plots in the same figure:
par(mfrow = c(1, 2))
plot(fit_veg, which = 1:2)

Finally, I can conduct a formal statistical test of normality. What is the null hypothesis here?

shapiro.test(residuals(fit_veg))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fit_veg)
## W = 0.98984, p-value = 0.9925

Use the code templates above to perform numerical and graphical tests for normality of your model residuals.

Model Selection

Recall that we can use the Akaike Information Criterion (AIC) as an objective procedure to compare several models.

AIC attempts to balance the model complexity (number of terms) and model fit (correlation, r-squared value) in order to find a ‘best’ model or models.

The idea is that adding additional terms to a model always increases the r-squared value, even if the additional predictor variables don’t have significant effects (i.e. the p-values associated with their coefficients are non-significant).
Simpler models may have lower r-squared values, but they are much easier to interpret and explain to others.

The AIC procedure rewards good model fit (via sums of squares) and penalizes complicated models (the number of model terms).

When you calculate AIC values for a set of models, the model(s) with the lowest AIC are said to be the best choice(s). Sometimes a group of models have very similar AIC values.

As a rule of thumb, differences in AIC values of more than about 5% of the absolute value of the models’ AIC are considered good evidence for a ‘best model’.
If AIC values are very similar, you may need to use your expert judgment to choose a best model (perhaps based on how easy it is to interpret and explain to others).

Use the AIC() function to calculate the AIC value for each of the four models you created.

Report

You’ll do your work in an RMarkdown document.

Your document needs to include:

A section and code chunk that reads in the data file.
Section for graphical and numerical exploration
Regression model section with second-level sections for each of the 4 models
A Model Selection section
A Report section in which you answer the following questions:

Report Questions

Q1 (1 pt.): What is the median distance to the nearest road?
Q2 (1 pt.): What is the Pearson correlation coefficient between the two predictor variables? Round your answer to 2 decimal places.
Q3 (2 pts.): Do you think the correlation between the two predictors is strong enough to cause problems in a regression model? Cite two pieces of graphical and/or numerical evidence.
Q4 (2 pts.): Review the model coefficient table for your simple linear regression model of vole population and vegetation. What were the intercept and slope parameters? Round your answers to 3 decimal places
Q5 (3 pts.): Interpret, in plain English, the ecological meaning of the value of the model’s intercept.
Q6 (3 pts.): Interpret, in plain English, the ecological meaning of the value of the model’s slope coefficient.
Q7 (1 pt.): Review the model coefficient table for your additive multiple regression model. What was the value of the intercept?
Q8 (3 pts.): Interpret, in plain English, the ecological meaning of the value of the additive model’s intercept. Make sure your answer includes a description of both predictors.
Q9 (3 pts.): Interpret, in plain English, the ecological meaning of the value of the additive model’s percent vegetation slope coefficient. Note: Your answer must include information about both predictors.
Q10 (3 pts.): Create a code chunk that makes a plot that contains 4 panels that show histograms of the residuals of each of your models. Hint: use the code par(mfrow = c(2, 2)) to set up the 4-panel plot. Each histogram must have an appropriate titles and axis labels.
Q11 (2 pts.): For each model, state whether you think it meets the assumption of normality of the residuals. Use the information from your numerical and graphical analyses of the residuals.
Q12 (2 pts.): What were the AIC values of each of the 4 models? Round your answers to 2 decimal places.
Q13 (2 pts.): In terms of the AIC values, was there a clear worst model? If so which one was worse and why?
Q14 (3 pts.): Considering the AIC values of the four models and the model complexity, which model would you choose?
- Your answer must include the model you chose and your reasoning.

Individual Assignment - Water Vole Regression Intro to Quantitative Ecology: