This lab is an extension of the individual assignment ‘Data Exploration and Deterministic Functions’ from the lecture course.
I suggest you work through the lecture assignment instructions and questions before proceeding.
The psych
package has a function,
pairs.panels()
which produces very nice looking
pairplots.
Install psych
and try it out with the iris data set!
psych
does not come with R so you must install it:install.packages("psych")
install.packages("psych")
again!Once you’ve installed it, use require()
to load the
package:
require(psych)
## Loading required package: psych
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
pairs.panels(iris)
pairs.panels()
function is part of the
psych
package.Suppose I only wanted to plot a subset of the columns in the iris dataset?
I could use subsetting by name, along with the square brackets, to pull out three columns of interest:
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
pairs.panels(iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length")])
here
package.Loading data files into R can be a challenge. Fortunately there are some tools to make the task easier. In this course we’ll be using RProjects in conjunction with the here package.
What’s the here package?
You can read a tutorial I made about the here package at this link.
I also created a YouTube tutorial, in case you like that sort of thing.
Take a moment to read about the package and install it (remember how to do this?).
You’ll need the bird census data, in addition to the habitat data you
use for the lecture questions, for these lab questions. Make sure these
data files are saved in the data
subdirectory of your
course RProject.
Here is the metadata, which describes the data and decodes the names of the columns:
Use here()
and read.csv()
to read
bird.sta.csv
into a data.frame
called
dat_bird
and hab.sta.csv
into a
data.frame
called dat_habitat
.
Hint: I saved the two data files to a subdirectory called
data
in my RProject. I used this syntax to create
dat_bird
:
require(here)
dat_bird = read.csv(
here("data", "bird.sta.csv")
)
head(dat_bird)
## basin sub sta AMCR AMDI AMGO AMRO BCCH BEWR BGWA BHCO BHGR BRCR BTPI BUSH CBCH CEWA CORA COYE DEJU DOWO EUST
## 1 D AL 1 0 0 0 1 0 0 0 0 2 0 0 0 2 0 0 0 0 0 0
## 2 D AL 2 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 0 0 0
## 3 D AL 3 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 D AL 4 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## 5 D AL 5 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## 6 D AL 6 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0
## EVGR GCKI GCSP GRJA HAFL HAWO HETH HEWA HOME HOWR HUVI MALL MAMU MGWA MOQU NOFL OCWA OSFL PISI PIWO PSFL PUFI
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## 2 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0
## 3 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## 5 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 2 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## PYOW RBNU RBSA RCKI RECR RSTO RTHA RUGR RUHU SCOW SOSP SOVI SPOW SSHA STJA SWTH TOSO TOWA TRSW UNBI UNFL UNTH
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## 5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0
## 6 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
## UNWA UNWO VATH VGSW WAVI WCSP WEBL WETA WIFL WIWA WIWR WODU WREN WWPE YRWA b.total b.rich b.sidi
## 1 0 0 0 0 1 0 0 2 0 0 1 0 0 0 0 13 9 0.12426036
## 2 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 11 8 0.14049587
## 3 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 8 7 0.15625000
## 4 0 0 1 0 0 0 0 0 0 2 4 0 0 0 0 10 6 0.24000000
## 5 0 0 0 0 0 0 0 0 0 2 1 0 1 0 0 14 10 0.09693878
## 6 0 0 0 0 2 0 0 0 0 2 1 0 0 0 0 10 8 0.14000000
A few things to notice:
here()
to locate a file within a
directory of my main RProject directory.head()
to preview the first 6 lines of
the data. Also note that there are so many columns that it is a little
hard to read!The two data files contain information about sampling locations and we’d like to be able to analyze both the bird count data and the habitat data.
We can combine both data sets into a new data frame, but we need to be careful:
dat_habitat
and dat_birds
?merge()
functionThe two data sets have several columns in common (which ones? that
will allow us to combine them and ensure that each row in
dat_birds
is associated with the correct row in
dat_habitat
.
Build a new data.frame called dat_all
using the
merge()
function with dat_birds
and
dat_habitat
.
You can test that your merge operation was successful by recreating this scatterplot with the code given below:
plot(ba.tot ~ elev, data = dat_all)
You may have noticed that there a lot of zeroes in the bird census data.
For sample, here are counts of Cedar Waxwings at 100 randomly sampled sites:
sample(dat_all$CEWA, 100)
## [1] 0 0 0 1 0 0 3 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [54] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0
Note: I had to run that code multiple times on my computer before I saw any non-zero entries! You should try running it on your machine several times on your machine to get an idea of how few sampling sites had cedar waxwings.
In fact, a total of only 116 cedar waxwings were observed at 53 (out of 1046) of sites.
Challenge: How could I use R to calculate the total number of waxwings and the number of sites in which they were present?
Since there are so many zeroes, it may be useful to think about bird presence/absence instead of counts.
You’ve created Boolean vectors using the equality test operator
==
in previous assignments:
my_vec = rep(1:3, 5)
my_vec == 3
## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
You could also use the greater than test to select the elements that are greater than 1:
my_vec > 1
## [1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
Use this as a template to write R code to build Boolean vector from the column of count data for cedar waxwings.
Your Boolean vector should have TRUE
values for each
sampling location where at least one bird was observed, and
FALSE
for each sampling location where the bird was
absent.
Here’s a cool trick to coerce a Boolean vector into a vector of ones and zeroes:
as.numeric(my_vec > 1)
## [1] 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1
Voilà!
Use the same strategy to convert your Boolean presence/absence vector
for cedar waxwings into a numeric vector. Let’s call it
cewa_present_absent
.
To test whether you have successfully built
cewa_present_absent
, run the following code to re-create
the plot below:
plot(x = dat_all$elev, y = cewa_present_absent)
How would you interpret the plot? Is elevation a good predictor of whether cedar waxwings are present?
NOTE: our visual intuition is a powerful tool, but it’s not foolproof. We can miss patterns, and see patterns that aren’t really there! Think about how much an electrical outlet looks like a face.
Review the description of the logistic function in McGarigal chapter 4. Pay special attention to the figures on pages 7 and 17 that shows how the parameters a and b relate to the shape and position of the curve.
Also note that the use of the parameters on the two figures is not consistent.
Many commonly used functions in ecology and other fields can be parameterized in two or more ways.
The choice which parameterization to use may depend on things such as:
Key things to notice about the logistic curve:
Recall the functions for visually parameterizing a linear function from the in-class coding activity?
Here are a set of functions that you can use to do the same with a logistic curve:
You can find the equations and figures I used to build these parameterization functions in the McGarigal Deterministic Functions chapter (page 17).
NOTE: For now, don’t worry about the details of how the code below works, you can think of it as a black box for the moment. We’ll talk much more about building custom functions later on.
get_logistic_param_a()
,
get_logistic_param_b
, logistic()
, and
logistic_midpoint_slope()
will be available for you to
use.# Function to calculate the logistic parameter a given the slope and midpoint
get_logistic_param_a = function(slope, midpoint)
{
b = slope / 4
return (-midpoint * (slope / 4))
}
# Function to calculate the logistic parameter b given the slope
get_logistic_param_b = function(slope)
{
return (slope / 4)
}
# Calculate the value of the logistic function at x, given the parameters a and b.
logistic = function(x, a, b)
{
val = exp(a + b * x)
return(val / (1 + val))
}
# Calculate the value of the logistic function at x, given a slope and midpoint.
logistic_midpoint_slope = function(x, midpoint, slope)
{
b = get_logistic_param_b(slope)
a = get_logistic_param_a(slope, midpoint)
return(logistic(x, a, b))
}
Let’s try to plot a logistic curve. Let’s use a midpoint of 400. We can try a slope of 0.1 to start with.
plot(x = dat_all$elev, y = cewa_present_absent)
curve(logistic_midpoint_slope(x, midpoint = 400, slope = 0.1), add = TRUE)
Let’s try a negative slope:
plot(x = dat_all$elev, y = cewa_present_absent)
curve(logistic_midpoint_slope(x, midpoint = 400, slope = -0.1), add = TRUE)
Let’s try a shallower negative slope:
plot(x = dat_all$elev, y = cewa_present_absent)
curve(logistic_midpoint_slope(x, midpoint = 400, slope = -0.005), add = TRUE)
Does elevation look like a good predictor of cedar waxwing presence/absence?
Does the logistic function look like a good model for cedar waxwing presence/absence?
Perhaps a different variable would better explain the bird presence/absence.
Some reflections on the plots:
A couple of arguments to plot()
and curve()
might be of interest:
cex
: this controls the size of the plotting character.
Default is 1.pch
: this controls the shape of the plotting character.
Default is 1. Code 16 plots a solid point.col
: this controls the color of the plotting character.
Default is 1, corresponding to black.A cool function that I like to use is adjustcolor()
. It
allows you to make points semi-transparent. For example, I used the
col
argument and adjustcolor()
inside a call
to plot()
to create the some modifications to one of my
plots above:
psych
to create a pair
plot of the three terrain variables (slope, aspect, elevation)
and basal area.Consider the first bird species you chose to examine in the walkthrough.
Q3 (1 pt.): Include a figure of your logistic function plot. Your figure must include the name of the bird species, appropriate title, axes, etc.
Q4 (3 pts.): Qualitatively describe the bird’s presence/absence patterns in terms of basal area (or your other chosen predictor). Your answer should make reference to your fitted logistic model plot. Some questions you might consider are:
Consider the second bird species you chose to examine in the walkthrough.
Q5 (1 pt.): Include a figure of your logistic function plot. Your figure must include the name of the bird species, appropriate title, axes, etc.
Q6 (3 pts.): Qualitatively describe the bird’s presence/absence patterns in terms of basal area (or your other chosen predictor). Your answer should make reference to your fitted logistic model plot. Some questions you might consider are:
Q8 (2 pts.): Show the R code you used to perform the calculation.
Q9 (1 pt.): Calculate the total number of sampling sites in which Gray Jays were observed.
Hint: What happens when you use the sum()
function on a
vector of Boolean values?
NOTE: To grade your R code, we will attempt to run it with only the
dat_all
data frame in the R environment. You should test
that your code does not depend upon any other variables.
Compile your answers to all 0 questions into a pdf document and submit via Moodle.