This lab covers three techniques that we’ve mentioned briefly in the lecture: Additive models, nonlinear least squares, and logistic regression.
In order to complete this lab, you’ll need some of the material you’ve used to answer questions in previous labs, most notably labs 3, 5, and 11.
Recall our power analysis of sample size for the Brown Creeper and Late-Successional Forest cover from lab 11.
For this section of the walkthrough, I’ll be using data from a sample size simulation. You can review the “Simulating Sample Sizes” section of lab 11 for the simualtion code. My simulation used the following characteristics:
To follow along with my code, you’ll need to conduct a similar
simulation and save the simulation output to a data.frame
called “sim_sample_size”.
Here’s plot of a realization of the simulation with 30 replicates at each sample size:
The plot looks pretty jagged; we can use a smoothing procedure to make the plot less rough. Take a look at an example before we dive into details:
Locally Weighted Sums of Squares (LOWESS, or sometimes LOESS) is a commonly-used smoothing procedure.
LOWESS works by fitting ‘local weighted regressions’ at each value of
the predictors.
Predictors with nearby values are given more ‘weight’ in the
model-fitting procedure.
The proportion of data points used in the locally weighted regressions affects the shape of the curve. Using a greater proportion of the data makes the curve smoother - it “smooths out” local detail. Using a smaller portion of the data for each local regression preserves local details.
Take a look at LOESS applied to the same data with different amounts of smoothing:
The syntax for fitting a LOWESS model is very similar to what we’ve
seen with lm()
.
Here I’ve fit a model using 50% of the data (the span
argument).
fit_lowess_50 = loess(power ~ sample_size, data = sim_sample_size, span = 0.5)
I can use predict()
to create a data.frame
of new values for plotting my smoother:
newdata_sample_size = data.frame(sample_size = seq(2, 20, length.out = 100))
Finally, I can plot it:
plot(
x = newdata_sample_size$sample_size,
y = predict(fit_lowess_50, newdata = newdata_sample_size),
type = "l",
ylab = "Statistical Power", xlab = "Sample Size")
If I wanted to be fancy, I could plot the non-smoothed simulation results in a different color and include a legend:
You now know enough R to make the fancy version of the plot yourself!
Do you remember the Ricker function?
And the Ricker function models we fit by hand in lab 5?
We can use Nonlinear Least Squares to fit an arbitrary function to data using a least squares criterion. I’ll present an example.
Background on the example
I performed the following prior to running the example code I show below. You should recreate these steps prior to following along. I’ll leave it up to you to review materials from lab 5 if you need a refresher on how to do these steps.
dat_dispersal
.read.csv()
and here()
.ricker_fun()
into my global
environment, and I used my guesses for the two parameters along with
curve()
to plot my guess fit.We’ll be using the nls()
function to fit our model.
Recall that nonlinear least squares attempts to find the parameter
values of a function that minimize the sum of squared errors (i.e. the
sum of squared residuals). Unlike linear regression, the function
doesn’t have to be linear in the parameters.
The Ricker function is definitely not linear in the
parameters!
Here’s the syntax to fit a Ricker model:
fit_ricker_nls = nls(
disp.rate.ftb ~ ricker_fun(dist.class, a, b),
data = dat_dispersal,
start = list(b = 0, a = 1))
summary(fit_ricker_nls)
##
## Formula: disp.rate.ftb ~ ricker_fun(dist.class, a, b)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## b 0.003586 0.000913 3.928 0.00201 **
## a 0.004471 0.001889 2.367 0.03558 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2149 on 12 degrees of freedom
##
## Number of iterations to convergence: 17
## Achieved convergence tolerance: 9.07e-06
## (1 observation deleted due to missingness)
The syntax shares some similarities to that of lm()
, but
notice that the model formula uses our ricker function. Notice also that
there is a start
argument. We use start
to
provide nls()
starting parameter values from which to
search for the optimal values.
In the interest of not re-inventing wheels, I’ll refer you to a good
blog post on using nls()
here.
It’s got a good explanation of the syntax, as well as a nice
example.
NLS fitting uses iterative methods to find the optimal parameter values. The algorithm uses the current parameter values to make guesses for the next iteration. When subsequent iterations do not result in changed parameter values (within a tolerance), the algorithm terminates.
The output of the model summary tells us how many iterations were needed to find the optimal parameter values.
NLS can be very sensitive to your initial guesses for parameter values.
Unintuitively, if your initial guesses are very close to the the true optima, the algorithm can fail and return an error. Note that the initial values I provided (0 and 1) are very different from the true optima.
Start by plotting your data points. I’ll let you create your own scatterplot, it should look something like this:
We’ll use the predict()
function with our fitted
nls
model. The predict()
function requires a
data.frame that contains values of the predictor variables. It is a
little bit picky though, the columns of the data frame have to have
exactly the same names as the predictor variables used to build the
model.
The first step is to make such a data frame:
dist_newdata = data.frame(dist.class = seq(0, 1600, length.out = 1600))
Now we can plot the results of our model!
The lines()
function will add to an existing plot:
lines(predict(fit_ricker_nls, newdata = dist_newdata))
legend("topright", legend = c("nls fit"), lty = 1, col = c(1))
Now let’s plot the model you fit visually. Refer back to lab 5 if you need a refresher.
Let’s revisit those logistic models of bird presence/absence that we fit by hand way back in lab 3!
Review the lab 3 document if you need a refresher on the data set. For these examples, I’ll use hermit warbler.
Here’s how the bird presence/absence data look when plotted against elevation, slope, and total basal area.
For this exercise, you’ll need the bird.sta.csv
and
hab.sta.csv
data files.
You’ll merge them to create the complete dataset:
dat_bird = read.csv(here("data", "bird.sta.csv"))
dat_habitat = read.csv(here("data", "hab.sta.csv"))
dat_all = merge(dat_bird, dat_habitat)
Finally, you’ll need to make a vector of presence/absence data for hermit warblers:
dat_all$HEWA_pres = dat_all$HEWA > 0
We can plot presence/absence against some potential predictors:
We know that our Group 1 models don’t work for binary response variables.
We’ll fit logistic models, which are a type of GLM that work with binary outcomes.
The syntax is very similar to what we used when fitting Group 1
models using lm()
Here’s how I could fit a series of models using total basal area and slope:
# Hermit warbler presence/absence
dat_all$HEWA_pres = dat_all$HEWA > 0
# Create model fits
fit_hewa_slope = glm(HEWA_pres ~ slope, data = dat_all, family = binomial)
fit_hewa_ba_tot = glm(HEWA_pres ~ ba.tot, data = dat_all, family = binomial)
fit_hewa_both_additive = glm(HEWA_pres ~ slope + ba.tot, data = dat_all, family = binomial)
fit_hewa_both_interactive = glm(HEWA_pres ~ slope * ba.tot, data = dat_all, family = binomial)
Notice the similarity to the lm()
syntax.
Notice also the addition of the family = binomial
argument. This tells glm()
that you want to do a logistic
regression.
Let’s look at the summary for the additive model:
summary(fit_hewa_both_additive)
##
## Call:
## glm(formula = HEWA_pres ~ slope + ba.tot, family = binomial,
## data = dat_all)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0460 -1.0543 -0.8341 1.1584 1.6802
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.183085 0.172553 -1.061 0.2887
## slope -0.009109 0.002962 -3.075 0.0021 **
## ba.tot 0.026479 0.003576 7.405 1.31e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1449.2 on 1045 degrees of freedom
## Residual deviance: 1379.2 on 1043 degrees of freedom
## AIC: 1385.2
##
## Number of Fisher Scoring iterations: 4
Unfortunately, plotting the best-fit curve for a logistic model isn’t
as simple as plotting the best-fit line for a linear model. Recall, we
could just use abline()
in the linear case!
We can use the predict()
function to predict response
values for given sets of predictor values. But… with a logistic model we
have to decide whether we want to predict the outcomes
(presence/absence) or the probability of success. We’ll go with the
probability of success, since it’s what we are really modeling.
For the slope, I can make a data.frame
object that
contains a sequence of new slopes. I’ll create a sequence of 500 slopes
so that I get a nice smooth curve.
n = 500
slope_newdata = data.frame(
slope = seq(
from = min(dat_all$slope, na.rm = T),
to = max(dat_all$slope, na.rm = T),
length.out = n
)
)
Things to notice:
na.rm
to avoid issues with missing data.Now you should make a new data.frame
of total basal area
values for prediction. Call it ba_newdata
.
Now we’re ready to create our model predictions.
We can use the predict()
function along with the
newdata
argument:
slope_newdata$hewa_predicted =
predict(
fit_hewa_slope,
newdata = slope_newdata,
type = "response"
)
ba_newdata$hewa_predicted =
predict(
fit_hewa_ba_tot,
newdata = ba_newdata,
type = "response"
)
Things to note:
data.frame
objects.type = "response"
argument so that
the predicted values are Pr(present), i.e. that the predicted values are
on the scale of probability.I can make basic plots of the single-factor models:
par(mfrow = c(2, 1))
# Presence/absence data, translucent points:
plot(
HEWA_pres ~ slope, data = dat_all,
xlab = "Percent Slope",
ylab = "HEWA presence/absence",
pch = 16, cex = 1.5, col = gray(0, 0.2)
)
lines(hewa_predicted ~ slope, data = slope_newdata)
plot(
HEWA_pres ~ ba.tot, data = dat_all,
xlab = "Basal Area",
ylab = "HEWA presence/absence",
pch = 16, cex = 1.5, col = gray(0, 0.2)
)
lines(hewa_predicted ~ ba.tot, data = ba_newdata)
Here’s some plots with nicer formatting
We can compare models using AIC:
AIC(
fit_hewa_ba_tot,
fit_hewa_slope,
fit_hewa_both_additive,
fit_hewa_both_interactive)
## df AIC
## fit_hewa_ba_tot 2 1392.759
## fit_hewa_slope 2 1446.574
## fit_hewa_both_additive 3 1385.185
## fit_hewa_both_interactive 4 1383.233
It looks like the additive and interactive models have the lowest AIC values, but how can we plot the response with two parameters?
We could use our 3D plotting skills (from last lab) to make an interactive perspective plot!
First, we need to set up our data on which to make predictions. We’ll need a dataframe that includes columns for both predictors: total basal area and slope:
First, let’s create sequences of values for both predictors:
n = 50
ba.tot = seq(
from = min(dat_all$ba.tot, na.rm = T),
to = max(dat_all$ba.tot, na.rm = T),
length.out = n)
slope = seq(
from = min(dat_all$slope, na.rm = T),
to = max(dat_all$slope, na.rm = T),
length.out = n)
I can take advantage of the expand.grid()
function to
make a data.frame
that contains every combination of the
two predictors:
new_dat_all = expand.grid(
ba.tot = ba.tot,
slope = slope)
head(new_dat_all)
## ba.tot slope
## 1 0.00000 0
## 2 4.22449 0
## 3 8.44898 0
## 4 12.67347 0
## 5 16.89796 0
## 6 21.12245 0
tail(new_dat_all)
## ba.tot slope
## 2495 185.8776 110
## 2496 190.1020 110
## 2497 194.3265 110
## 2498 198.5510 110
## 2499 202.7755 110
## 2500 207.0000 110
Now, I can use predict()
to make a vector of model
predictions:
new_dat_all$pred_add = predict(
fit_hewa_both_additive,
newdata = new_dat_all,
type = "response")
The contour and 3D plotting methods require matrices for the data on the z-axis:
z_hewa_add = matrix(
new_dat_all$pred_add,
nrow = length(ba.tot),
byrow = FALSE)
z_hewa_int = matrix(
new_dat_all$pred_int,
nrow = length(ba.tot),
byrow = FALSE)
Finally, I’m ready to make my interactive 3D plot:
require(rgl)
rgl::persp3d(
x = ba.tot,
y = slope,
z = z_hewa_add,
col = "steelblue",
xlab = "Basal Area",
ylab = "Slope",
zlab = "Pr(present)",
alpha = 0.4)
rglwidget()
And the interactive model (you can figure out the code):
Do you notice much difference?
It might be easier to tell with a contour plot:
par(mfrow = c(1, 2))
contour(
x = ba.tot, y = slope,
z = z_hewa_add,
xlab = "Total Basal Area",
ylab = "Percent Slope",
main = "Additive")
contour(
x = ba.tot,
y = slope,
z = z_hewa_int,
xlab = "Total Basal Area",
ylab = "Percent Slope",
main = "Interactive")
We could make our 3D plots even fancier by color-coding the surface based on the z-value.
Note: this code is a little clunky, don’t worry if it doesn’t make sense, it’s totally optional, but the colors make the difference between the interactive and additive model easier to see.
# Create a color ramp function using heat colors
colorfunc = colorRampPalette(
heat.colors(length(ba.tot)))
# Figure out the indices of the color ramp
col_indices_hewa_add = findInterval(
new_dat_all$pred_add,
seq(
min(new_dat_all$pred_add),
max(new_dat_all$pred_add),
length.out = 50))
rgl::persp3d(
x = ba.tot,
y = slope,
z = z_hewa_add,
xlab = "Basal Area",
ylab = "Slope",
zlab = "Pr(present)",
alpha = 0.9,
col = colorfunc(50)[col_indices_hewa_add]
)
rglwidget()
And the interactive model:
Instructions
Instructions
Instructions
Q3 (1 pt.): What are the AIC values for each of the 4 models?
Q4 (1 pt.): Which model would you choose, and why?
Q5 (1 pt.): Based on the model coefficient table of your chosen model, describe the direction and significance of the relationship(s) of the predictor variable or variables to the binary response. Make sure your answer is in terms of the ecological context.
Predicted Values
Review the material on predicting response variable values from logistic models.
slope_newdata
and
ba_newdata
objects that you created in the
walkthrough.** Simple Model Plots**
Create plots of presence/absence and model curves for both single-predictor models.
2-Predictor Model Plots
Review the material in the walkthrough about making 3D perspective and 2D contour plots of model parameters.
Create contour plots (or optionally 3D perspective plots) like those in the walkthrough, but for your models of golden crowned kinglet.
rglwidget()
in your code chunk.saveWidget()
function (see lab 11 for
more info).