Analysis of Environmental Data at the University of Massachusetts, Amherst
This page contains information and supporting documents for the Analysis of Environmental Data lecture and lab courses at the University of Massachusetts Amherst.
Analysis of Environmental Data (ECo 602/634) is a core course for master’s and Ph.D. students in the Department of Environmental Conservation at Mass Amherst.
This course provides students with an understanding of basic statistical concepts critical to the proper use and understanding of statistics in ecology and conservation science and prepares students for subsequent ECO courses in ecological modeling. The lecture (required for all ECO Master’s level graduate students) covers foundational concepts in statistical modeling (emphasis is on conceptual underpinnings of statistics not methodology, with a focus on defining statistical models and the major inference paradigms in use today), basic study design concepts (emphasis is on confronting practical issues associated with real-world ecological study designs and statistical modeling), and lays out the ‘landscape’ of statistical methods for ecological modeling; emphasis is on the conceptual underpinnings of statistical modeling instead of methodology, with a focus on defining.
This laboratory course introduces the statistical computing language R and provides hands-on experience using R to screen and adjust data, examine deterministic functions and probability distributions, conduct classic one- and two-sample tests, utilize bootstrapping and Monte Carlo randomization procedures, and conduct stochastic simulations for ecological modeling.
Specifically, lab focuses on learning the R language and statistical computing environment, which serves as the computing platform for all ECO statistics courses; emphasis is on learning fundamental R skills that will allow students to grow and expand their expertise in subsequent courses or on their own.
The course draws readings from diverse texts and journal articles including:
We also utilize Kevin McGarigal’s materials for previous versions of this course. They are a nice synthesis of content from Bolker, Zuur, and other sources written in a very accessible style.
Computer programs and other resources used in the course include:
This class is supported by DataCamp, the most intuitive learning platform for data science and . Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 325+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 5 million learners around the world and close your skills gap.
You should complete the readings for the week before the first lecture on Tuesday. Completing the readings before class will enable you to more fully engage with the in-class activities and discussions.
Reading question sets will be due the following Sunday by 11:59PM. For example the reading questions for week 2 are due on Sunday Sep. 18th at 11:59PM.
This week’s readings are:
There are no week 1 reading questions
Topic Highlights
Readings and Questions
This week’s readings are:
Tuesday
Thursday
Lab
This question draws mostly upon materials from the Bolker reading: Chapter 1, sections 1.1 - 1.3, and the ideas in the Model Thinking lecture and notes.
Choose one of the modeling dichotomies that Bolker writes about in sections 1.1 - 1.3 (summarized in table 1.1 on page 6).
This question draws mostly upon materials from the McGarigal chapter 1 slides.
As used in common language, the terms assumption and bias are usually used to describe negative, value-laden concept that we should avoid.
Our usage of these terms is slightly different. In this course we’re more interested in identifying important (and sometimes hidden) assumptions and biases. Sometimes we gain important insights simply by identifying assumptions and biases in our modeling process.
Just like uncertainty, we can improve our ability to detect hidden assumptions and biases. We can then be more informed modelers.
McGarigal states in chapter 1:
“Western science and our society requires that challenges to the status quo be empirically and rigorously demonstrated (analogy:”innocent until proven guilty“)”
“… This is because we live in a world where challenges to the status quo are given little credence without solid evidence for the alternative (analogy: innocent until proven guilty).”
“Whether we are presenting our findings to a scientific audience (e.g., in a scientific journal) or to managers, policy-makers, or the general public, we are increasingly asked to defend our conclusions on the basis of statistical evidence.”
Part of being a good modeler is identifying biases: implicit, cultural, scientific, etc.
Often these biases come from implicit assumptions that we don’t even know we’re making!
Consider some potential assumptions and/or biases in the above quotes, and in the description of the four testimonials regarding climate change and bird nesting habitat.
This question draws mostly upon materials from the McGarigal Chapter 1 and Bolker Chapter 1 readings.
This question draws mostly upon materials from the McGarigal chapter 2 slides.
This question is related McGarigal chapter 2 slides, and the in-class group model thinking activity.
Consider the scenario your group chose to use in the model thinking in-class activity:
Choose 2 of the of the following data types and scales.
For each of your chosen variable type/scale types:
For example, if I were studying herbivory of of Monarch caterpillars on different species of Milkweeds (Asclepias spp) I might measure the host plant species on a categoical/nominal scale.
I could measure amount of herbivory on a ratio scale because a value of zero herbivory is meaningful.
Remember, you only need to chose 2 of the 4 variable types/scales.
Save your answers in a pdf document and upload to Moodle. Make sure you include:
:::
Topic Highlights
Tuesday
Thursday
Make sure to read the in-class assignment directions before you come to class!
Lab
Readings and Questions
this week’s readings are:
Consider the following types of plots described in the McGarigal and Zuur readings:
Consider the following types of plots described in the McGarigal and Zuur readings:
Conditional plot, conditioning variable, and related terms occurred throughout the Zuur and McGarigal readings.
Consider a dataset that you have collected or worked with.
If you haven’t worked much with existing datasets hypothesize a dataset that you might collect for your research.
Save your answers in a pdf document and upload to Moodle. Make sure you include:
Topic Highlights
Tuesday
Thursday
Lab - Start Lab 4 - Continue Lab 3
Readings and Questions
this week’s readings are:
McGarigal presented two studies of Brown creepers:
For each model:
McGarigal presented two studies of Brown creepers:
For each model:
McGarigal presented two studies of Brown creepers:
For each model:
McGarigal presented a simulated example of density-dependent predator-prey interactions in which he fit several different models to the data.
Consider only the Ricker and quadratic models.
Some concepts to keep in mind:
Save your answers in a pdf document and upload to Moodle. Make sure you include:
Topic Highlights
Readings
Tuesday
Thursday
Lab
Q1 (2 pts.): Choose the best words or phrases to fill in the blanks: A probability distribution is a map from the (a)_____ to the (b)_____.
Q2 (2 pts.): How many possible outcomes are there (i.e. what is the sample space) if you flip two coins sequentially: a penny and a quarter? Assume that
Q3 (2 pts.): How many possible outcomes are there (i.e. what is the sample space) if you flip two quarters at the same time? Assume that
Q4 (2 pts.): How many outcomes are there if you flip a penny three times? If you care about the order of flips, how many possible events are there in the sample space?
Q5 (1 pt.): Are the events in the previous question combinations, or permutations?
Q6 (2 pts.): Now suppose you don’t care about the order, and you simply want to know about the number of heads when you flip the penny three times. How many possible events are in the sample space?
Q7 (1 pt.): Are the events in the previous question combinations, or permutations?
A sample space is the ….
Suppose it is a beautiful fall day and you are sitting underneath three oak trees: Bur oak (Quercus. macrocarpa), Northern Red Oak (Q. rubra), and White oak (Q. alba). They’ve just started to drop their acorns.
Without looking, you reach down and pick up two acorns in one hand at the same time and shuffle them around before you look.
Describe the sample space of your collection (i.e. enumerate the set of all possible outcomes).
Some things to consider when describing your sample space?
A sample space is the ….
Suppose it is a beautiful fall day and you are sitting underneath three oak trees: Bur oak (Quercus. macrocarpa), Northern Red Oak (Q. rubra), and white oak (Q. alba). They’ve dropped most of their acorns. It was a productive year so there seem to be thousands of acorns from each species!
You collect an acorn, place it in your left pocket, walk a short distance and collect a second acorn placing it in your right pocket.
Some things to consider when describing your sample space?
For the questions below consider two discrete probability distributions, parameterized as:
Q17 (1 pt.): Which of the following is the size of the sample space of this Poisson distribution?
Q18 (2 pts.): Which of the following is the size of the sample space of this Binomial distribution?
Q19 (2 pts.): Describe a characteristic that is common to both the Binomial and Poisson distributions that makes them good models for counts.
Q20 (2 pts.): Hypothesize a scenario in which a Binomial distribution may be a better count model than a Poisson distribution.
Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:
Topic Highlights
Readings
Tuesday
Thursday
Lab
The Bolker reading was difficult….
Bolker used a seed predation experiment to illustrate the statistical frameworks.
The primary question in his examples is: Do seed predation rates vary among species?
Reminder: in the Frequentist paradigm, a null hypothesis can be used as a baseline against which you can compare your observations.
I have found that recreating the calculations in a difficult reading helps me understand and follow difficult readings.
In that spirit, you’ll use data presented in Bolker Table 1.2 (on the top of page 11) to calculate the seed predation rates.
Here’s some template R code you can use to get started:
# Clear your R environment to make
# sure there are no stray variables.
rm(list = ls())
pol_n_predation = 26
pol_n_no_predation = 184
pol_n_total = ????
pol_predation_rate = ????
psd_n_predation = ????
psd_n_no_predation = ????
psd_n_total = ????
psd_predation_rate = ????
Self-test: Run the following code after you have made your calculations. Your rates should match the observed proportions in the Bolker text on page 11.
print(
paste0(
"The seed predation rate for Polyscias fulva is: ",
round(pol_predation_rate, digits = 3)))
print(
paste0(
"The seed predation rate for Pseudospondias microcarpa is: ",
round(psd_predation_rate, digits = 3)))
pol_n_predation = 26
pol_n_no_predation = 184
pol_n_total = 210
pol_predation_rate = pol_n_predation/pol_n_total
psd_n_predation = 25
psd_n_no_predation = 706
psd_n_total = 731
psd_predation_rate = psd_n_predation/psd_n_total
predation_ratio = pol_predation_rate/psd_predation_rate
round(predation_ratio, digits = 3)
## [1] 3.62
Hint: To make sure that your code is written correctly, you need to run it in a empty R environment.
The call rm(list = ls()) on the first line of the code template removes all variables from the environment. Make sure you include it in your code!
Create a table and fill in the missing values:
species | Any taken | None taken | N | Predation rate |
---|---|---|---|---|
Polyscias fulva (pol) | 26 | 184 | __ | __ |
Pseudospondias microcarpa (psd) | __ | __ | __ | __ |
Use the seed predation proportions you calculated to determine the ratio of seed predation proportions.
Things to consider:
Save your answers in a pdf document (or a knitted html document) and upload to Moodle. Make sure you include:
Topic Highlights
Tuesday
Thursday
Readings
this week’s readings are:
A few things to remember about Frequentist Confidence Intervals:
The width of confidence intervals is influenced by properties of both the population and the samplinng process.
Recall that we are not 95% certain that a 95% confidence interval we calculate contains the true value.
For Questions 1 - 4, assume you are working with a population that is normally-distributed with mean \(\mu\) and standard deviation \(\sigma\). Note that although these population parameters exist, you cannot know their exact values and you must estimate them through sampling.
Your explanation will be more successful if you use an example or describe your answer in the context of a real-life scenario rather than a purely theoretical explanation.
Save your answers in a pdf document and upload to moodle. Make sure you include:
Topic Highlights
Tuesday
Thursday
Readings
this week’s readings are:
Refer back to sections 7.1 and 8.2 for McGarigal’s descriptions of the form of the linear statistical model for the non parametric OLS and parametric likelihood-based inference techniques.
Note: McGarigal specifies the parametric model using this notation:
\(Y \sim Normal(a + bx, \sigma)\)
However, both the parametric and non parametric model can be expressed in the more familiar regression model format:
\(y_i = \beta_0 + \beta_1 x_i + e_i\)
Interpolation and extrapolation may both be used to make predictions.
Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:
Topic Highlights
Tuesday
Thursday
Readings
this week’s readings are:
“In the best case, your data will match a classical technique like linear regression exactly, and the answers provided by classical statistical models will agree with the results from your likelihood model.” - Bolker (2008)
Bolker describes custom-made analyses based on Maximum Likelihood, which often have a biological, ecological, or mechanistic justification.
He contrasts these with the familiar Least Squares, canned methods that we typically learn in our first statistics course.
“The normality assumption means that if we repeat the sampling many times under the same environmental conditions, the observations will be normally distributed for each value of X.” - Zuur (2007)
A common misconception about this assumption is that the values of the response variable must be normally distributed.
Consider this histogram of penguin bill lengths:
Bill lengths appear very non-normal.
A the very low p-value in the Shapiro test of normality provides strong evidence against the null hypothesis that bill lengths are normally-distributed.
shapiro.test(penguins$bill_length_mm)
Shapiro-Wilk normality test
data: penguins$bill_length_mm
W = 0.97485, p-value = 1.12e-05
Nevertheless, a general linear model of bill lengths as that includes species and body mass as predictors passes a test of the normality assumption:
fit_1 = lm(bill_length_mm ~ body_mass_g + species, data = penguins)
shapiro.test(residuals(fit_1))
Shapiro-Wilk normality test
data: residuals(fit_1)
W = 0.99317, p-value = 0.123
Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:
Topic Highlights
Tuesday
Thursday
Readings
this week’s readings are:
“The first part of the AIC definition is a measure of goodness of fit. The second part is a penalty for the number of parameters in the model.” - Zuur (2007)
Consider the regression equation for a simple linear regression:
\(y_i = \alpha + \beta_1 x_i + \epsilon\)
Your answer must be in plain non-technical language. Your explanation will be most effective if you use a narrative approach, using a concrete example to illustrate the concept.
Consider an experiment looking at plant biomass response to water treatments.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 2.4 | 2.19 | 98.371 | 0.001 |
waterMed | 1.3 | 5.12 | 0.480 | 0.231 |
waterHigh | 13.6 | 3.48 | 24.495 | 0.001 |
Q3 (1 pt.): Based on the model table, what is the base case water treatment?
Q4 (2 pts.): What is the average plant mass, in grams, for the low water treatment?
Q5 (2 pts.): What is the average plant mass, in grams, for the medium water treatment?
Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:
Topic Highlights
Tuesday
Thursday
Readings
this week’s readings are:
No Reading Questions
Topic Highlights
Tuesday
Thursday
Readings
this week’s readings are:
McGarigal writes:
We expect a model with more parameters to fit better in the sense that the negative log-likelihood should be smaller if we add more terms to the model. But we also expect that adding more parameters to a model leads to increasing difficulty of interpretation.
Consider the trade off between model complexity and interpretability.
Since your answer is targeted to a non-scientist audience, you should use narrative style using a concrete example.
Consider this table of model coefficients from a plant growth experiment with the following continuous predictor variables.
Note: The amount of water, N, and P given to each plant was randomized at the beginning of the experiment.
The response variable was plant biomass accumulation (in grams).
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -1.7 | 0.23 | 98.371 | 0.061 |
water | 0.043 | 0.001 | 0.480 | 0.021 |
nitro | 0.192 | 0.034 | 1.495 | 0.007 |
phosph | -0.027 | 0.014 | 0.091 | 0.721 |
Q3 (2 pts.): Using the information in the model coefficient table above, calculate the expected biomass for a plant given:
0 mL water per week
0 mg nitrogen per week
0 mg phosphorus per week
Explain how you made the calculation.
Q4 (2 pts.): Using the information in the model coefficient table above, what is the expected biomass for a plant given:
10 mL water per week
30 mg nitrogen per week
20 mg phosphorus per week
Explain how you made the calculation.
Consider the data types/scales of the predictor and response variables.
We often present the equation for a simple linear regression model as:
\(y_i = \alpha + \beta_1 x_i + \epsilon\)
We often present the equation for a simple linear regression model as:
\(y_i = \alpha + \beta_1 x_i + \epsilon\)
Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:
Topic Highlights
Tuesday
Thursday
Readings
This week’s readings are:
Please note that these lecture decks may be updated to correct spelling or other errors, fix formatting, include additional content, etc.
I will endeavor to make the slide decks available at least one week prior to the lecture session in which they will be used.
Deck 3: Data Exploration, Functions, and Associations
Deck 4: Distributions: Notation, Functions, and Probability
Deck 5: Frequentist Hypotheses and Confidence
Deck 6: Frameworks: Least Squares, Likelihood, Frequentist, Bayesian
Deck 8: Beyond The General Linear Model
Deck 9: Interactions, Dummy Variables, and Model Interpretation
Deck 10: Conditional Probability and Intro to Bayesian Perspective
\n
to insert a
line break into a title or axis label.col
argument to specify a color for the
bars.border
argument to specify the color of
the outlines of the bars.adjustcolor()
function to make a color
lighter by specifying an alpha value of less than 1.# In-Class Fancy Histogram Demo
require(palmerpenguins)
hist(
penguins$bill_length_mm,
main = "Hist 'o Gram of Bill Length\nBy Mike Nelson",
col =
adjustcolor(col = "steelblue", alpha.f = .4),
border = "red",
xlab = "Bill Length (in mm)")
If a histogram has some very short bins at high values of x, you can
truncate the display using the xlim
argument. For example,
if you had some data stored in a data frame and you wanted to make a
histogram of the column called wingspan
:
This doesn’t look great:
hist(
dat$wingspan,
main = "Histogram of Wingspan",
xlab = "wingspan (cm)")
I can truncate the x-values to be between 0 and 10 using
xlim
.
I’ll also make more bins using the breaks
argument.
breaks
argument to be a
suggestion… Telling it to create 30 bins won’t always result in
30 bins. You can experiment with different numbers of bins until you
find a number that works for your plot.hist(
dat$wingspan,
main = "Histogram of Wingspan",
xlab = "wingspan (cm)",
xlim = c(0, 10),
breaks = 30)
Click the links for details about each assignment.
Click the links for details about each assignment.
Click the assignment name for walkthrough and questions
You can find the final project instructions and problems here.
The final problem set consists of two parts:
Here are some supplemental links to helpful resources for various topics covered in class. I’ll continually update this list. If you find a resource that was helpful for you, let me know and I’ll add it here!
We only briefly mention the ggplot paradigm in the course, but this Complete ggplot2 Tutorial may be useful for those who want to learn more.