Learning objectives, technical sills and concepts:

  • Data exploration
  • Deterministic function: the logistic curve
  • Pretty pair plots

Introduction to Lab 3

This lab is an extension of the individual assignment ‘Data Exploration and Deterministic Functions’ from the lecture course.

I suggest you work through the lecture assignment instructions and questions before proceeding.

Nicer pairplots

The psych package has a function, pairs.panels() which produces very nice looking pairplots.

Install psych and try it out with the iris data set!

  • psych does not come with R so you must install it:
install.packages("psych")
  • You only need to install the package once. After you’ve successfully installed it, you should not run install.packages("psych") again!
  • We’ll also go over installing packages in the lecture.

Once you’ve installed it, use require() to load the package:

require(psych)
## Loading required package: psych
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
pairs.panels(iris)

  • The pairs.panels() function is part of the psych package.

Subsetting the Iris data

Suppose I only wanted to plot a subset of the columns in the iris dataset?

I could use subsetting by name, along with the square brackets, to pull out three columns of interest:

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
pairs.panels(iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length")])

The here package.

Loading data files into R can be a challenge. Fortunately there are some tools to make the task easier. In this course we’ll be using RProjects in conjunction with the here package.

What’s the here package?

  • Its a utility package that works in conjunction with RRpojects to make finding your data files much easier.

You can read a tutorial I made about the here package at this link.

I also created a YouTube tutorial, in case you like that sort of thing.

Take a moment to read about the package and install it (remember how to do this?).

Lab Data

You’ll need the bird census data, in addition to the habitat data you use for the lecture questions, for these lab questions. Make sure these data files are saved in the data subdirectory of your course RProject.

Here is the metadata, which describes the data and decodes the names of the columns:

Use here() and read.csv() to read bird.sta.csv into a data.frame called dat_bird and hab.sta.csv into a data.frame called dat_habitat.

Hint: I saved the two data files to a subdirectory called data in my RProject. I used this syntax to create dat_bird:

require(here)
dat_bird = read.csv(
  here("data", "bird.sta.csv")
)
head(dat_bird)
##   basin sub sta AMCR AMDI AMGO AMRO BCCH BEWR BGWA BHCO BHGR BRCR BTPI BUSH CBCH CEWA CORA COYE DEJU DOWO EUST
## 1     D  AL   1    0    0    0    1    0    0    0    0    2    0    0    0    2    0    0    0    0    0    0
## 2     D  AL   2    0    0    0    0    0    0    0    0    2    0    0    0    2    0    0    0    0    0    0
## 3     D  AL   3    0    0    0    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     D  AL   4    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0    0
## 5     D  AL   5    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0    0
## 6     D  AL   6    0    0    0    0    0    0    0    0    1    1    0    0    0    0    0    0    0    0    0
##   EVGR GCKI GCSP GRJA HAFL HAWO HETH HEWA HOME HOWR HUVI MALL MAMU MGWA MOQU NOFL OCWA OSFL PISI PIWO PSFL PUFI
## 1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0
## 2    0    0    0    0    0    2    0    0    0    0    0    0    0    1    0    0    0    1    0    0    1    0
## 3    0    0    0    0    0    1    0    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0
## 4    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0
## 5    0    1    0    0    0    0    0    0    0    0    1    0    0    1    0    0    0    0    0    0    2    0
## 6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0
##   PYOW RBNU RBSA RCKI RECR RSTO RTHA RUGR RUHU SCOW SOSP SOVI SPOW SSHA STJA SWTH TOSO TOWA TRSW UNBI UNFL UNTH
## 1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    2    1    0    0    0    0    0    0
## 2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 3    0    0    0    0    0    1    0    0    0    0    0    0    0    0    1    0    1    0    0    0    0    0
## 4    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0    0
## 5    0    0    0    0    0    1    0    0    0    0    0    0    0    0    0    2    0    0    0    1    0    0
## 6    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    1    0    0    0    0    0    0
##   UNWA UNWO VATH VGSW WAVI WCSP WEBL WETA WIFL WIWA WIWR WODU WREN WWPE YRWA b.total b.rich     b.sidi
## 1    0    0    0    0    1    0    0    2    0    0    1    0    0    0    0      13      9 0.12426036
## 2    0    0    0    0    0    0    0    0    0    1    1    0    0    0    0      11      8 0.14049587
## 3    0    0    0    0    0    2    0    0    0    0    0    0    0    0    0       8      7 0.15625000
## 4    0    0    1    0    0    0    0    0    0    2    4    0    0    0    0      10      6 0.24000000
## 5    0    0    0    0    0    0    0    0    0    2    1    0    1    0    0      14     10 0.09693878
## 6    0    0    0    0    2    0    0    0    0    2    1    0    0    0    0      10      8 0.14000000

A few things to notice:

  • I spread the code across multiple lines to make the syntax clearer.
  • Note the use of here() to locate a file within a directory of my main RProject directory.
  • Notice that the subdirectory data and the filename bird.sta.csv are both in quotation marks since I want R to interpret them as literal text and not expressions to be evaluated.
  • Note the use of head() to preview the first 6 lines of the data. Also note that there are so many columns that it is a little hard to read!

Merge the data

The two data files contain information about sampling locations and we’d like to be able to analyze both the bird count data and the habitat data.

We can combine both data sets into a new data frame, but we need to be careful:

  • How do we know that there are the same number of rows in both data frames?
  • How can we be sure that we associate the correct row of dat_habitat and dat_birds?

The merge() function

The two data sets have several columns in common (which ones? that will allow us to combine them and ensure that each row in dat_birds is associated with the correct row in dat_habitat.

Build a new data.frame called dat_all using the merge() function with dat_birds and dat_habitat.

You can test that your merge operation was successful by recreating this scatterplot with the code given below:

plot(ba.tot ~ elev, data = dat_all)

Convert bird data to presence/absence.

You may have noticed that there a lot of zeroes in the bird census data.

For sample, here are counts of Cedar Waxwings at 100 randomly sampled sites:

sample(dat_all$CEWA, 100)
##   [1] 0 0 0 1 0 0 3 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [54] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0

Note: I had to run that code multiple times on my computer before I saw any non-zero entries! You should try running it on your machine several times on your machine to get an idea of how few sampling sites had cedar waxwings.

In fact, a total of only 116 cedar waxwings were observed at 53 (out of 1046) of sites.

Challenge: How could I use R to calculate the total number of waxwings and the number of sites in which they were present?

  • These are slightly different questions, each requiring slightly different code.

Presence/Absence

Since there are so many zeroes, it may be useful to think about bird presence/absence instead of counts.

You’ve created Boolean vectors using the equality test operator == in previous assignments:

my_vec = rep(1:3, 5)
my_vec == 3
##  [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE

You could also use the greater than test to select the elements that are greater than 1:

my_vec > 1
##  [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE

Use this as a template to write R code to build Boolean vector from the column of count data for cedar waxwings.

Your Boolean vector should have TRUE values for each sampling location where at least one bird was observed, and FALSE for each sampling location where the bird was absent.

Data type coercion

Here’s a cool trick to coerce a Boolean vector into a vector of ones and zeroes:

as.numeric(my_vec > 1)
##  [1] 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1

Voilà!

Use the same strategy to convert your Boolean presence/absence vector for cedar waxwings into a numeric vector. Let’s call it cewa_present_absent.

To test whether you have successfully built cewa_present_absent, run the following code to re-create the plot below:

plot(x = dat_all$elev, y = cewa_present_absent)

How would you interpret the plot? Is elevation a good predictor of whether cedar waxwings are present?

NOTE: our visual intuition is a powerful tool, but it’s not foolproof. We can miss patterns, and see patterns that aren’t really there! Think about how much an electrical outlet looks like a face.

Fitting a logistic curve.

The shape of the curve:

Review the description of the logistic function in McGarigal chapter 4. Pay special attention to the figures on pages 7 and 17 that shows how the parameters a and b relate to the shape and position of the curve.

Also note that the use of the parameters on the two figures is not consistent.

Many commonly used functions in ecology and other fields can be parameterized in two or more ways.

The choice which parameterization to use may depend on things such as:

  • The convention in your field
  • Some parameterizations have different ecological meanings. For example the exponential distribution can be parameterized in two ways:
  • \(\mu\) parameterization: \(\mu\) is the mean number of events.
  • \(\lambda\) parameterization: \(\lambda\) is the mean wait time until an event occurs.

Key things to notice about the logistic curve:

  • You can slide the logistic curve to the right or left. Think this as moving its midpoint. The midpoint is the x-value at which y is equal to 0.5.
  • You can make the curve steeper or shallower near its midpoint. Think of this as the slope.
  • This might remind you of the local linearity concept from calculus and the lecture part of the course.

Visually estimating the parameters of the curve

Recall the functions for visually parameterizing a linear function from the in-class coding activity?

Here are a set of functions that you can use to do the same with a logistic curve:

You can find the equations and figures I used to build these parameterization functions in the McGarigal Deterministic Functions chapter (page 17).

NOTE: For now, don’t worry about the details of how the code below works, you can think of it as a black box for the moment. We’ll talk much more about building custom functions later on.

  • Just paste the code into your console and run it. After you do that the functions get_logistic_param_a(), get_logistic_param_b, logistic(), and logistic_midpoint_slope() will be available for you to use.
# Function to calculate the logistic parameter a given the slope and midpoint
get_logistic_param_a = function(slope, midpoint)
{
  b = slope / 4
  return (-midpoint * (slope / 4))
}

# Function to calculate the logistic parameter b given the slope
get_logistic_param_b = function(slope)
{
  return (slope / 4)
}


# Calculate the value of the logistic function at x, given the parameters a and b.
logistic = function(x, a, b)
{
  val = exp(a + b * x)
  return(val / (1 + val))
}

# Calculate the value of the logistic function at x, given a slope and midpoint.
logistic_midpoint_slope = function(x, midpoint, slope)
{
  b = get_logistic_param_b(slope)
  a = get_logistic_param_a(slope, midpoint)
  return(logistic(x, a, b))
}

Logistic Fit 1

Let’s try to plot a logistic curve. Let’s use a midpoint of 400. We can try a slope of 0.1 to start with.

plot(x = dat_all$elev, y = cewa_present_absent)
curve(logistic_midpoint_slope(x, midpoint = 400, slope = 0.1), add = TRUE)

Logistic Fit 2

Let’s try a negative slope:

plot(x = dat_all$elev, y = cewa_present_absent)
curve(logistic_midpoint_slope(x, midpoint = 400, slope = -0.1), add = TRUE)

Logistic Fit 3

Let’s try a shallower negative slope:

plot(x = dat_all$elev, y = cewa_present_absent)
curve(logistic_midpoint_slope(x, midpoint = 400, slope = -0.005), add = TRUE)

Does elevation look like a good predictor of cedar waxwing presence/absence?

Does the logistic function look like a good model for cedar waxwing presence/absence?

Perhaps a different variable would better explain the bird presence/absence.

Plot Thoughts

Some reflections on the plots:

  • Look at those ugly default x- and y- axis labels!
  • These figures would also be a lot better with appropriate titles!
  • Lots of the points overlap, making it hard to distinguish individual points. What are some ways you might improve the data representation?

A couple of arguments to plot() and curve() might be of interest:

  • cex: this controls the size of the plotting character. Default is 1.
  • pch: this controls the shape of the plotting character. Default is 1. Code 16 plots a solid point.
  • col: this controls the color of the plotting character. Default is 1, corresponding to black.

A cool function that I like to use is adjustcolor(). It allows you to make points semi-transparent. For example, I used the col argument and adjustcolor() inside a call to plot() to create the some modifications to one of my plots above:

Instructions and Deliverables

  1. Use the pair plot function from psych to create a pair plot of the three terrain variables (slope, aspect, elevation) and basal area.
  2. Choose two bird species and create plots of presence/absence (on the y-axis) and basal area (on the x axes).
  • Consult the metadata file for the key to bird names and abbreviations.
  • Visually inspect the plots and fit logistic curves using the parameterization functions provided above.
  • If total basal area does not seem like a good predictor, you should try:
    • The basal area of either conifers, hardwood trees, or snags.
    • The terrain variables.
  1. Answer the assignment questions and submit your final answers via Moodle.

Lab Questions

Basal Area, Questions 1 - 2.

  • Q5 (1 pt.): What is basal area, and how is it measured?
  • Q2 (2 pts.): Include a figure of your terrain/basal area pairplot.

Bird Species 1: Questions 3 - 4

Consider the first bird species you chose to examine in the walkthrough.

  • Q3 (1 pt.): Include a figure of your logistic function plot. Your figure must include the name of the bird species, appropriate title, axes, etc.

  • Q4 (3 pts.): Qualitatively describe the bird’s presence/absence patterns in terms of basal area (or your other chosen predictor). Your answer should make reference to your fitted logistic model plot. Some questions you might consider are:

    • Does the bird species seem to prefer areas with high or low tree cover?
    • Does the bird species prefer low or high elevations? (for example, if you used elevation instead of basal area)
    • Does a logistic model seem like a good fit?

Bird Species 2: Questions 5 - 6

Consider the second bird species you chose to examine in the walkthrough.

  • Q5 (1 pt.): Include a figure of your logistic function plot. Your figure must include the name of the bird species, appropriate title, axes, etc.

  • Q6 (3 pts.): Qualitatively describe the bird’s presence/absence patterns in terms of basal area (or your other chosen predictor). Your answer should make reference to your fitted logistic model plot. Some questions you might consider are:

    • Does the bird species seem to prefer areas with high or low tree cover?
    • Does the bird species prefer low or high elevations? (for example, if you used elevation instead of basal area)
    • Does a logistic model seem like a good fit?

Gray Jays: Question 7 - 10

  • Q7 (1 pt.): How many total number of Gray Jays were observed in all of the sampling sites.
  • Q8 (2 pts.): Show the R code you used to perform the calculation.

  • Q9 (1 pt.): Calculate the total number of sampling sites in which Gray Jays were observed.

Hint: What happens when you use the sum() function on a vector of Boolean values?

  • Q10 (2 pts.): Include the R code you used to perform the presence/absence calculation.

NOTE: To grade your R code, we will attempt to run it with only the dat_all data frame in the R environment. You should test that your code does not depend upon any other variables.

Report

Compile your answers to all 0 questions into a pdf document and submit via Moodle.

  • You may also do your work in an R Notebook and submit a rendered html file.