Lab Concepts and Objectives

This lab covers three techniques that we’ve mentioned briefly in the lecture: Additive models, nonlinear least squares, and logistic regression.

In order to complete this lab, you’ll need some of the material you’ve used to answer questions in previous labs, most notably labs 3, 5, and 11.

Additive Models: LOWESS

Recall our power analysis of sample size for the Brown Creeper and Late-Successional Forest cover from lab 11.

For this section of the walkthrough, I’ll be using data from a sample size simulation. You can review the “Simulating Sample Sizes” section of lab 11 for the simualtion code. My simulation used the following characteristics:

  • Sample size varied between 2 and 20.
  • I used the observed values from the model:
    • Slope (effect size)
    • Intercept
    • Model standard deviation
  • Alpha level of 0.05
  • 30 simulations at each sample size

To follow along with my code, you’ll need to conduct a similar simulation and save the simulation output to a data.frame called “sim_sample_size”.

Here’s plot of a realization of the simulation with 30 replicates at each sample size:

Locally Weighted Sums of Squares

The plot looks pretty jagged; we can use a smoothing procedure to make the plot less rough. Take a look at an example before we dive into details:

Locally Weighted Sums of Squares (LOWESS, or sometimes LOESS) is a commonly-used smoothing procedure.

LOWESS works by fitting ‘local weighted regressions’ at each value of the predictors.
Predictors with nearby values are given more ‘weight’ in the model-fitting procedure.

The proportion of data points used in the locally weighted regressions affects the shape of the curve. Using a greater proportion of the data makes the curve smoother - it “smooths out” local detail. Using a smaller portion of the data for each local regression preserves local details.

Take a look at LOESS applied to the same data with different amounts of smoothing:

Fitting a LOWESS Model

The syntax for fitting a LOWESS model is very similar to what we’ve seen with lm().

Here I’ve fit a model using 50% of the data (the span argument).

fit_lowess_50 = loess(power ~ sample_size, data = sim_sample_size, span = 0.5)

I can use predict() to create a data.frame of new values for plotting my smoother:

newdata_sample_size = data.frame(sample_size = seq(2, 20, length.out = 100)) 

Finally, I can plot it:

plot(
  x = newdata_sample_size$sample_size,
  y = predict(fit_lowess_50, newdata = newdata_sample_size),
  type = "l",
  ylab = "Statistical Power", xlab = "Sample Size")

If I wanted to be fancy, I could plot the non-smoothed simulation results in a different color and include a legend:

You now know enough R to make the fancy version of the plot yourself!

Nonlinear Least Squares

Do you remember the Ricker function?

And the Ricker function models we fit by hand in lab 5?

We can use Nonlinear Least Squares to fit an arbitrary function to data using a least squares criterion. I’ll present an example.

Background on the example

I performed the following prior to running the example code I show below. You should recreate these steps prior to following along. I’ll leave it up to you to review materials from lab 5 if you need a refresher on how to do these steps.

  • Note that in the example code, I use a data.frame called dat_dispersal.
  • I read the dispersal.csv file from lab 5 using read.csv() and here().
  • I’ve run the code to store ricker_fun() into my global environment, and I used my guesses for the two parameters along with curve() to plot my guess fit.

Fitting a NLS model in R

We’ll be using the nls() function to fit our model.

Recall that nonlinear least squares attempts to find the parameter values of a function that minimize the sum of squared errors (i.e. the sum of squared residuals). Unlike linear regression, the function doesn’t have to be linear in the parameters.
The Ricker function is definitely not linear in the parameters!

Here’s the syntax to fit a Ricker model:

fit_ricker_nls = nls(
  disp.rate.ftb ~ ricker_fun(dist.class, a, b),
  data = dat_dispersal,
  start = list(b = 0, a = 1))
summary(fit_ricker_nls)
## 
## Formula: disp.rate.ftb ~ ricker_fun(dist.class, a, b)
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)   
## b 0.003586   0.000913   3.928  0.00201 **
## a 0.004471   0.001889   2.367  0.03558 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2149 on 12 degrees of freedom
## 
## Number of iterations to convergence: 17 
## Achieved convergence tolerance: 9.07e-06
##   (1 observation deleted due to missingness)

The syntax shares some similarities to that of lm(), but notice that the model formula uses our ricker function. Notice also that there is a start argument. We use start to provide nls() starting parameter values from which to search for the optimal values.

In the interest of not re-inventing wheels, I’ll refer you to a good blog post on using nls() here. It’s got a good explanation of the syntax, as well as a nice example.

NLS fitting uses iterative methods to find the optimal parameter values. The algorithm uses the current parameter values to make guesses for the next iteration. When subsequent iterations do not result in changed parameter values (within a tolerance), the algorithm terminates.

The output of the model summary tells us how many iterations were needed to find the optimal parameter values.

NLS can be very sensitive to your initial guesses for parameter values.

Unintuitively, if your initial guesses are very close to the the true optima, the algorithm can fail and return an error. Note that the initial values I provided (0 and 1) are very different from the true optima.

Plotting an NLS model

Start by plotting your data points. I’ll let you create your own scatterplot, it should look something like this:

We’ll use the predict() function with our fitted nls model. The predict() function requires a data.frame that contains values of the predictor variables. It is a little bit picky though, the columns of the data frame have to have exactly the same names as the predictor variables used to build the model.

The first step is to make such a data frame:

dist_newdata = data.frame(dist.class = seq(0, 1600, length.out = 1600))  

Now we can plot the results of our model!

The lines() function will add to an existing plot:

lines(predict(fit_ricker_nls, newdata = dist_newdata))
legend("topright", legend = c("nls fit"), lty = 1, col = c(1))

Adding your ‘guess’ model

Now let’s plot the model you fit visually. Refer back to lab 5 if you need a refresher.

Logistic Regression

Let’s revisit those logistic models of bird presence/absence that we fit by hand way back in lab 3!

Hermit Warbler Presence/Absence

Review the lab 3 document if you need a refresher on the data set. For these examples, I’ll use hermit warbler.

Here’s how the bird presence/absence data look when plotted against elevation, slope, and total basal area.

For this exercise, you’ll need the bird.sta.csv and hab.sta.csv data files.

You’ll merge them to create the complete dataset:

dat_bird = read.csv(here("data", "bird.sta.csv"))
dat_habitat = read.csv(here("data", "hab.sta.csv"))
dat_all = merge(dat_bird, dat_habitat)

Finally, you’ll need to make a vector of presence/absence data for hermit warblers:

dat_all$HEWA_pres = dat_all$HEWA > 0
  • Can you explain why this code works?

Terrain Plots

We can plot presence/absence against some potential predictors:

Binary outcome data

We know that our Group 1 models don’t work for binary response variables.

We’ll fit logistic models, which are a type of GLM that work with binary outcomes.

The syntax is very similar to what we used when fitting Group 1 models using lm()

Here’s how I could fit a series of models using total basal area and slope:

# Hermit warbler presence/absence
dat_all$HEWA_pres = dat_all$HEWA > 0

# Create model fits
fit_hewa_slope = glm(HEWA_pres ~ slope, data = dat_all, family = binomial)
fit_hewa_ba_tot = glm(HEWA_pres ~ ba.tot, data = dat_all, family = binomial)
fit_hewa_both_additive = glm(HEWA_pres ~ slope + ba.tot, data = dat_all, family = binomial)
fit_hewa_both_interactive = glm(HEWA_pres ~ slope * ba.tot, data = dat_all, family = binomial)

Notice the similarity to the lm() syntax.

Notice also the addition of the family = binomial argument. This tells glm() that you want to do a logistic regression.

Let’s look at the summary for the additive model:

summary(fit_hewa_both_additive)
## 
## Call:
## glm(formula = HEWA_pres ~ slope + ba.tot, family = binomial, 
##     data = dat_all)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0460  -1.0543  -0.8341   1.1584   1.6802  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.183085   0.172553  -1.061   0.2887    
## slope       -0.009109   0.002962  -3.075   0.0021 ** 
## ba.tot       0.026479   0.003576   7.405 1.31e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1449.2  on 1045  degrees of freedom
## Residual deviance: 1379.2  on 1043  degrees of freedom
## AIC: 1385.2
## 
## Number of Fisher Scoring iterations: 4
  • Does either predictor have a significant slope coefficient?

Plotting a fitted simple logistic model

Unfortunately, plotting the best-fit curve for a logistic model isn’t as simple as plotting the best-fit line for a linear model. Recall, we could just use abline() in the linear case!

We can use the predict() function to predict response values for given sets of predictor values. But… with a logistic model we have to decide whether we want to predict the outcomes (presence/absence) or the probability of success. We’ll go with the probability of success, since it’s what we are really modeling.

For the slope, I can make a data.frame object that contains a sequence of new slopes. I’ll create a sequence of 500 slopes so that I get a nice smooth curve.

n = 500

slope_newdata = data.frame(
  slope = seq(
    from = min(dat_all$slope, na.rm = T),
    to = max(dat_all$slope, na.rm = T),
    length.out = n
  )
)

Things to notice:

  • I used the minimum and maximum observed values of slope in the original data.
  • I used na.rm to avoid issues with missing data.

Now you should make a new data.frame of total basal area values for prediction. Call it ba_newdata.

Predicted Data

Now we’re ready to create our model predictions.

We can use the predict() function along with the newdata argument:

slope_newdata$hewa_predicted = 
  predict(
    fit_hewa_slope,
    newdata = slope_newdata,
    type = "response"
  )

ba_newdata$hewa_predicted = 
  predict(
    fit_hewa_ba_tot,
    newdata = ba_newdata,
    type = "response"
  )

Things to note:

  • I added the predicted values as new columns to the data.frame objects.
  • You need to use the type = "response" argument so that the predicted values are Pr(present), i.e. that the predicted values are on the scale of probability.

Plotting the models 1

I can make basic plots of the single-factor models:

par(mfrow = c(2, 1))

# Presence/absence data, translucent points:
plot(
  HEWA_pres ~ slope, data = dat_all,
  xlab = "Percent Slope",
  ylab = "HEWA presence/absence",
  pch = 16, cex = 1.5, col = gray(0, 0.2)
)

lines(hewa_predicted ~ slope, data = slope_newdata)

plot(
  HEWA_pres ~ ba.tot, data = dat_all,
  xlab = "Basal Area",
  ylab = "HEWA presence/absence",
  pch = 16, cex = 1.5, col = gray(0, 0.2)
)

lines(hewa_predicted ~ ba.tot, data = ba_newdata)

Plotting the models 2

Here’s some plots with nicer formatting

Plotting Both Parameters

We can compare models using AIC:

AIC(
  fit_hewa_ba_tot,
  fit_hewa_slope,
  fit_hewa_both_additive,
  fit_hewa_both_interactive)
##                           df      AIC
## fit_hewa_ba_tot            2 1392.759
## fit_hewa_slope             2 1446.574
## fit_hewa_both_additive     3 1385.185
## fit_hewa_both_interactive  4 1383.233

It looks like the additive and interactive models have the lowest AIC values, but how can we plot the response with two parameters?

We could use our 3D plotting skills (from last lab) to make an interactive perspective plot!

Data Setup

First, we need to set up our data on which to make predictions. We’ll need a dataframe that includes columns for both predictors: total basal area and slope:

First, let’s create sequences of values for both predictors:

n = 50

ba.tot = seq(
  from = min(dat_all$ba.tot, na.rm = T),
  to = max(dat_all$ba.tot, na.rm = T),
  length.out = n)
slope = seq(
  from = min(dat_all$slope, na.rm = T),
  to = max(dat_all$slope, na.rm = T),
  length.out = n)

I can take advantage of the expand.grid() function to make a data.frame that contains every combination of the two predictors:

new_dat_all = expand.grid(
  ba.tot = ba.tot,
  slope = slope)
head(new_dat_all)
##     ba.tot slope
## 1  0.00000     0
## 2  4.22449     0
## 3  8.44898     0
## 4 12.67347     0
## 5 16.89796     0
## 6 21.12245     0
tail(new_dat_all)
##        ba.tot slope
## 2495 185.8776   110
## 2496 190.1020   110
## 2497 194.3265   110
## 2498 198.5510   110
## 2499 202.7755   110
## 2500 207.0000   110

Now, I can use predict() to make a vector of model predictions:

new_dat_all$pred_add = predict(
  fit_hewa_both_additive,
  newdata = new_dat_all,
  type = "response")

The contour and 3D plotting methods require matrices for the data on the z-axis:

z_hewa_add = matrix(
  new_dat_all$pred_add,
  nrow = length(ba.tot),
  byrow = FALSE)
z_hewa_int = matrix(
  new_dat_all$pred_int,
  nrow = length(ba.tot),
  byrow = FALSE)

Plot in 3D

Finally, I’m ready to make my interactive 3D plot:

require(rgl)

rgl::persp3d(
  x = ba.tot,
  y = slope,
  z = z_hewa_add,
  col = "steelblue",
  xlab = "Basal Area",
  ylab = "Slope",
  zlab = "Pr(present)",
  alpha = 0.4)
rglwidget()

And the interactive model (you can figure out the code):

Do you notice much difference?

Contour Plot

It might be easier to tell with a contour plot:

par(mfrow = c(1, 2))
contour(
  x = ba.tot, y = slope,
  z = z_hewa_add,
  xlab = "Total Basal Area",
  ylab = "Percent Slope",
  main = "Additive")
contour(
  x = ba.tot,
  y = slope,
  z = z_hewa_int,
  xlab = "Total Basal Area",
  ylab = "Percent Slope",
  main = "Interactive")

Even fancier 3D plots (optional)

We could make our 3D plots even fancier by color-coding the surface based on the z-value.

Note: this code is a little clunky, don’t worry if it doesn’t make sense, it’s totally optional, but the colors make the difference between the interactive and additive model easier to see.

# Create a color ramp function using heat colors
colorfunc = colorRampPalette(
  heat.colors(length(ba.tot)))

# Figure out the indices of the color ramp
col_indices_hewa_add = findInterval(
  new_dat_all$pred_add,
  seq(
    min(new_dat_all$pred_add),
    max(new_dat_all$pred_add),
    length.out = 50))

rgl::persp3d(
  x = ba.tot,
  y = slope,
  z = z_hewa_add,
  xlab = "Basal Area",
  ylab = "Slope",
  zlab = "Pr(present)",
  alpha = 0.9,
  col = colorfunc(50)[col_indices_hewa_add]
)
rglwidget()

And the interactive model:

Lab Questions

Q1: LOWESS

Instructions

  1. Review the population dispersion and statistical power simulation you created for Lab 11.
  2. Re-run the simulation with with a small number of replicates, say 30, so that you get a jagged plot.
  3. Fit a LOWESS model using 30% of the data.
  4. Create a plot like the one I showed above of your simulated data and your fitted LOWESS model. Your plot must show:
    • The smooth LOWESS line
    • Points representing the simulation data.
    • An appropriate legend, title, and axis labels
  • Q1 (2 pts.): Include your plot in your lab report.

Q2: NLS

Instructions

  1. Follow the lab walkthrough and make sure you can recreate the plots.
  2. Review the exponential model you fit in lab 5.
  3. Fit an NLS exponential model to the marbled salamander data.
  4. Create a plot of your NLS and visually-fitted exponential models. Your plot must include:
  • A line for your visually-estimated model.
  • A line for your model fit via NLS.
  • Appropriate legend, title, and axis labels.
  • Q2 (2 pts.): Include your plot in your lab report.

Q3-5: Logistic Models 1: Model Fits

Instructions

  • Fit four logistic models of golden crowned kinglet (species code GCKI) presence/absence as predicted by:
    • slope
    • total basal area
    • slope and basal area (additive)
    • slope and basal area (interactive)

Q3 (1 pt.): What are the AIC values for each of the 4 models?

Q4 (1 pt.): Which model would you choose, and why?

Q5 (1 pt.): Based on the model coefficient table of your chosen model, describe the direction and significance of the relationship(s) of the predictor variable or variables to the binary response. Make sure your answer is in terms of the ecological context.

Q6-7: Logistic Models 2: Graphical Exploration

Predicted Values

Review the material on predicting response variable values from logistic models.

  • For your two single-predictor models for Golden Crowned Kinglet, predict values of the response variable.
    • You should use the slope_newdata and ba_newdata objects that you created in the walkthrough.

** Simple Model Plots**

Create plots of presence/absence and model curves for both single-predictor models.

My basic plots looked like this (click to show/hide)
Optional challenge: recreate my fancy plots (click to show/hide)

2-Predictor Model Plots

Review the material in the walkthrough about making 3D perspective and 2D contour plots of model parameters.

Create contour plots (or optionally 3D perspective plots) like those in the walkthrough, but for your models of golden crowned kinglet.

  • If you decide to make 3D contour plots and include them in an RMarkdown document, make sure you include the call to rglwidget() in your code chunk.
  • If you want to make 3D interactive contour plots, but don’t want to do your work in an RMarkdown document, you may submit them as individual html files using the saveWidget() function (see lab 11 for more info).
  • Q6 (2 pts.): Include your two single-predictor model plots in your report.
  • Q7 (4 pts.): Include contour plots (or interactive 3D perspective plots) in your report.