Welcome

This page contains information and supporting documents for the Analysis of Environmental Data lecture and lab courses at the University of Massachusetts Amherst.

Analysis of Environmental Data (ECo 602/634) is a core course for master’s and Ph.D. students in the Department of Environmental Conservation at Mass Amherst.

About the lecture

Check out the lecture syllabus here

This course provides students with an understanding of basic statistical concepts critical to the proper use and understanding of statistics in ecology and conservation science and prepares students for subsequent ECO courses in ecological modeling. The lecture (required for all ECO Master’s level graduate students) covers foundational concepts in statistical modeling (emphasis is on conceptual underpinnings of statistics not methodology, with a focus on defining statistical models and the major inference paradigms in use today), basic study design concepts (emphasis is on confronting practical issues associated with real-world ecological study designs and statistical modeling), and lays out the ‘landscape’ of statistical methods for ecological modeling; emphasis is on the conceptual underpinnings of statistical modeling instead of methodology, with a focus on defining.

About the lab

This laboratory course introduces the statistical computing language R and provides hands-on experience using R to screen and adjust data, examine deterministic functions and probability distributions, conduct classic one- and two-sample tests, utilize bootstrapping and Monte Carlo randomization procedures, and conduct stochastic simulations for ecological modeling.

Specifically, lab focuses on learning the R language and statistical computing environment, which serves as the computing platform for all ECO statistics courses; emphasis is on learning fundamental R skills that will allow students to grow and expand their expertise in subsequent courses or on their own.

Course Materials

Readings

The course draws readings from diverse texts and journal articles including:

Bolker, B.M. (2008). Ecological models and data in R (Princeton University Press). [Electronic version available at UMass Libraries]
Zuur, A.F. (2007). Analyzing ecological data (New York; London: Springer). [Electronic version available at UMass Libraries]
Epstein, J.M. (2008). Why Model? Journal of Artificial Societies and Social Simulation 11, 12.
Bang, Megan, Ananda Marin, and Douglas Medin. If Indigenous Peoples Stand with the Sciences, Will Scientists Stand with Us? Daedalus 147, no. 2 (March 1, 2018): 148–59.
Jorge Luis Borges: The Library of Babel

McGarigal Readings

We also utilize Kevin McGarigal’s materials for previous versions of this course. They are a nice synthesis of content from Bolker, Zuur, and other sources written in a very accessible style.

Computer Resources

Computer programs and other resources used in the course include:

DataCamp

This class is supported by DataCamp, the most intuitive learning platform for data science and . Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 325+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 5 million learners around the world and close your skills gap.

Assignment Data Files

Weekly Schedule and Reading Questions

You should complete the readings for the week before the first lecture on Tuesday. Completing the readings before class will enable you to more fully engage with the in-class activities and discussions.

Reading question sets will be due the following Sunday by 11:59PM. For example the reading questions for week 2 are due on Sunday Sep. 18th at 11:59PM.

Week 1: Introductions. September 6, 8

Topic Highlights

Introduction to model thinking

Required Readings

This week’s readings are:

Epstein: Why Model?
McGarigal ch. 1
Lecture slide deck 1

Tuesday Sep 6

Course introduction
Intro to model thinking

Thursday Sep 8

In-class model-thinking. Make sure you read the assignment instructions before class!

Lab

Start labs 1 and 2

There are no week 1 reading questions

Week 2: Data and Model Thinking. September 13, 15

Topic Highlights

Data!
Sampling: populations, samples
Models and model thinking.
Preview of Frequentism. Don’t worry, we’ll revisit these concepts many times.

Readings and Questions

This week’s readings are:

Slide decks 1 and 2
McGarigal ch. 1:
McGarigal Chapter 2: Environmental Data
Bolker Chapter 1: Intro and Background
- Read sections 1.1, 1.2, 1.3
Bang, Megan, Ananda, Marin, and Medin

Tuesday

Finish deck 1, start deck 2

Thursday

In-class R-coding activity. Make sure you read the assignment instructions before class!
Week 2 reading question time.

Lab

Continue Labs 1 and 2

Week 2 Questions

Q1: Dichotomies

This question draws mostly upon materials from the Bolker reading: Chapter 1, sections 1.1 - 1.3, and the ideas in the Model Thinking lecture and notes.

Many of the terms in the table will be unfamiliar. Please don’t let that discourage you!
The Bolker and Zuur texts are classics, but they are both very dense and difficult reads (even for me).

Choose one of the modeling dichotomies that Bolker writes about in sections 1.1 - 1.3 (summarized in table 1.1 on page 6).

Q1 (2 pts.): In 1 - 2 short paragraphis, explain the dichotomy in your own words and briefly describe how you might approach one of your research interests from each of the dichotomy endpoints.

Q2: Assumptions and Biases

This question draws mostly upon materials from the McGarigal chapter 1 slides.

Concepts in the Bang et al. paper may also provide insight.

As used in common language, the terms assumption and bias are usually used to describe negative, value-laden concept that we should avoid.

Our usage of these terms is slightly different. In this course we’re more interested in identifying important (and sometimes hidden) assumptions and biases. Sometimes we gain important insights simply by identifying assumptions and biases in our modeling process.

Just like uncertainty, we can improve our ability to detect hidden assumptions and biases. We can then be more informed modelers.

McGarigal states in chapter 1:

“Western science and our society requires that challenges to the status quo be empirically and rigorously demonstrated (analogy:”innocent until proven guilty“)”

“… This is because we live in a world where challenges to the status quo are given little credence without solid evidence for the alternative (analogy: innocent until proven guilty).”

“Whether we are presenting our findings to a scientific audience (e.g., in a scientific journal) or to managers, policy-makers, or the general public, we are increasingly asked to defend our conclusions on the basis of statistical evidence.”

These seem like great ideals to strive for, right? But are there any biases in these statements?

Part of being a good modeler is identifying biases: implicit, cultural, scientific, etc.

Often these biases come from implicit assumptions that we don’t even know we’re making!

Consider some potential assumptions and/or biases in the above quotes, and in the description of the four testimonials regarding climate change and bird nesting habitat.

Q2 (2 pts.): Identify at least one source of bias or assumption (cultural, scientific, other). Hypothesize a practical impact these biases or assumptions might have on scientific communication and the effectiveness of management efforts? (1 - 3 paragraphs)

Q3: Dual Model Paradigm

This question draws mostly upon materials from the McGarigal Chapter 1 and Bolker Chapter 1 readings.

Q3 (2 pts.): In 1 - 2 short paragraphs, describe the following:
- Identify and briefly define the two primary components of a model constructed in the dual model paradigm.
- Give an example of the two components in the context of a system you are interested in studying.

Q4: Populations

This question draws mostly upon materials from the McGarigal chapter 2 slides.

Q4 (2 pts.): In 1 - 2 short paragraphs, describe the difference between a statistical and biological or ecological population.
- Which of these populations may vary depending on the spatial or temporal scale of the research question?

Q5: Model Thinking

This question is related McGarigal chapter 2 slides, and the in-class group model thinking activity.

Consider the scenario your group chose to use in the model thinking in-class activity:

Cascades snow pack
White pine blister rust
Cattails

Choose 2 of the of the following data types and scales.

A continuous variable on an ratio scale
A categorical, nominal variable
A discrete variable
A numerical variable on an interval scale

For each of your chosen variable type/scale types:

Propose an entity and/or variable in your scenario that you could measure using the data type/scale.
Explain why the data type or scale is appropriate for the entity/variable you chose.

For example, if I were studying herbivory of of Monarch caterpillars on different species of Milkweeds (Asclepias spp) I might measure the host plant species on a categoical/nominal scale.

I could measure amount of herbivory on a ratio scale because a value of zero herbivory is meaningful.

Remember, you only need to chose 2 of the 4 variable types/scales.

Q5 (2 pts.): For each of your two chosen variables: Describe your proposed entity or variable and explain your chosen data type/scale is appropriate.

Report

Save your answers in a pdf document and upload to Moodle. Make sure you include:

Your name
The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

:::

Week 3: Data Exploration. September 20, 22

Topic Highlights

Data exploration
- Graphical and numerical
Plot types
Data in R: Import, types, data structures

Tuesday

In-Class Data exploration 1

Thursday

In-Class Reading Data Files
Reading question Q+A

Make sure to read the in-class assignment directions before you come to class!

Lab

Start lab 3

Readings and Questions

this week’s readings are:

McGarigal Chapter 3: Exploration, sections 1 - 5
Zuur Chapter 4: Exploration, section 4.1.
- You don’t need to read about the last two plot types: lattice graphs and design/interaction plots
Bolker Chapter 2: Data Analysis and Graphics
- Read Sections 2.1 through 2.3. There are a lot of helpful R examples and tips.

Week 3 Questions

Q1: Plots

Consider the following types of plots described in the McGarigal and Zuur readings:

Histogram
Scatterplot
Cleveland dotplot
Boxplot
coplot

Q1 (1 pt.): Which of the plot types show every data point?

Q2: Plots

Consider the following types of plots described in the McGarigal and Zuur readings:

Histogram
Scatterplot
Cleveland dotplot
Boxplot
coplot

Q2 (1 pt.): Which of the plot types show aggregated or summarized data?

Q3: Conditioning Variables

Conditional plot, conditioning variable, and related terms occurred throughout the Zuur and McGarigal readings.

Q3 (3 pts.): Explain what a conditional variable means in the context of graphical data exploration.

Dispersion: Q4 - Q5

Q4 (1 pt.): List at least three of the common measures of spread or dispersion that were mentioned in the readings.
Q5 (2 pts.): Choose two of the measures in your list and explain how they capture different aspects of the concept of spread.

Q6: Data Exploration

Consider a dataset that you have collected or worked with.

If you haven’t worked much with existing datasets hypothesize a dataset that you might collect for your research.

Q6 (5 pts.): List two of the important reasons to perform data exploration (numerical and/or graphical).
- For each of the two reasons you identify, describe the quantities or plots you would use and the insight you would gain.

Report:

Save your answers in a pdf document and upload to Moodle. Make sure you include:

Your name
The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 4: Functions. September 27, 29

Topic Highlights

Deterministic Functions
Classes of Functions
Function Intuition

Tuesday

Continue Deck 3
In-Class Data Exploration 2

Thursday

Finish Deck 3
In-Class Probability 1: Acorns Preview

Lab - Start Lab 4 - Continue Lab 3

Readings and Questions

this week’s readings are:

McGarigal Chapter 4: Deterministic Functions
Bolker Chapter 3.Deterministic Functions for Ecological Modeling
- Read the chapter introduction (Section 3.1)

Week 4 Questions

Q1: Predictors

McGarigal presented two studies of Brown creepers:

A model of Brown creeper abundance explained by late-successional forest percent.
A model of Brown creeper presence/absence explained total basal area (a measure of tree cover).

For each model:

Consider what types of data were collected in each study:
- Are they continuous, discrete, categorical?
- What is the data scale?
What kind of deterministic function is used.

Q1 (2 pts.): For both models (abundance and presence/absence) identify:
1. The predictor variable(s).
2. The data type/scale used for the predictor variable.

Q2: Responses

McGarigal presented two studies of Brown creepers:

A model of Brown creeper abundance explained by late-successional forest percent.
A model of Brown creeper presence/absence explained total basal area (a measure of tree cover).

For each model:

Consider what types of data were collected in each study:
- Are they continuous, discrete, categorical?
- What is the data scale?
What kind of deterministic function is used.

Q2 (2 pts.): For both models (abundance and presence/absence) identify:
1. The response variable.
2. The data type/scale used for the response variable.

Q3: Model Constraints

McGarigal presented two studies of Brown creepers:

A model of Brown creeper abundance explained by late-successional forest percent.
A model of Brown creeper presence/absence explained total basal area (a measure of tree cover).

For each model:

Consider what types of data were collected in each study:
- Are they continuous, discrete, categorical?
- What is the data scale?
What kind of deterministic function is used.

Q3 (4 pts.): For both models: How did the data type or scale influence or constrain the choice of model?

Q4: Predator-Prey Model

McGarigal presented a simulated example of density-dependent predator-prey interactions in which he fit several different models to the data.

Consider only the Ricker and quadratic models.

Some concepts to keep in mind:

mechanistic vs. phenomenological
goodness-of-fit
previous knowledge of predator-prey interactions

Q4 (1 pt.): What are the pros and cons of the Ricker model? What are the pros and cons of the quadratic model?

Report

Save your answers in a pdf document and upload to Moodle. Make sure you include:

Your name
The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 5: Probability Distributions. October 4, 6

Topic Highlights

Readings

Bolker ch. 1: Intro and Background
- Read section 1.6: Outline of the Modeling Process
McGarigal 5: Probability Distributions
Jorge Luis Borges: La Biblioteca de Babel (The Library of Babel)
- Don’t worry, you can read the English translation!
Optional Bolker chapter 4: Probability and Stochastic Distributions for Ecological Modeling
- This is a much more in-depth and technical overview than we will be covering in class.
- This reading is totally optional, I’ve provided it as a reference only in case you’re interested.

Tuesday

Begin Deck 4
In-Class Probability 2: Birds and Acorns
Finish any outstanding in-class activities (if needed)

Thursday

Finish Deck 4
In-Class Chunks, Chunk Options, and Tabsets
Week 5 Reading Question Time

Lab

Week 5 Questions

Warm-Up Questions: Q1 - Q7

Q1 (2 pts.): Choose the best words or phrases to fill in the blanks: A probability distribution is a map from the (a)_____ to the (b)_____.
Q2 (2 pts.): How many possible outcomes are there (i.e. what is the sample space) if you flip two coins sequentially: a penny and a quarter? Assume that
- the two coins each have a head and a tail
- you care about order, but you may flip either coin first.
- the probability of heads or tails is about 0.5 for each coin.
Q3 (2 pts.): How many possible outcomes are there (i.e. what is the sample space) if you flip two quarters at the same time? Assume that
- the two coins are indistinguishable
  - i.e. you just want to know the number of heads or tails for each possible outcome.
- each have a head and a tail
- the probability of heads or tails is about 0.5 for each quarter.
Q4 (2 pts.): How many outcomes are there if you flip a penny three times? If you care about the order of flips, how many possible events are there in the sample space?
Q5 (1 pt.): Are the events in the previous question combinations, or permutations?
Q6 (2 pts.): Now suppose you don’t care about the order, and you simply want to know about the number of heads when you flip the penny three times. How many possible events are in the sample space?
Q7 (1 pt.): Are the events in the previous question combinations, or permutations?

Simultaneous Acorns 1: Q8 - Q10

A sample space is the ….

Suppose it is a beautiful fall day and you are sitting underneath three oak trees: Bur oak (Quercus. macrocarpa), Northern Red Oak (Q. rubra), and White oak (Q. alba). They’ve just started to drop their acorns.

Without looking, you reach down and pick up two acorns in one hand at the same time and shuffle them around before you look.

Describe the sample space of your collection (i.e. enumerate the set of all possible outcomes).

Some things to consider when describing your sample space?

Assume that two acorns of the same species are indistinguishable.
In your 2-acorn draw, what is an event?
How many elements are in each possible event?
Does the order or arrangement of acorns matter?

Q8 (2 pts.): What is the size of the sample space?
Q9 (2 pts.): Given the scenario description, how many ways are to there to collect two acorns of the same species?
Q10 (2 pts.): Given the scenario description, how many ways can you collect two acorns of different species?

Sequential Acorns Q11 - 16

A sample space is the ….

Suppose it is a beautiful fall day and you are sitting underneath three oak trees: Bur oak (Quercus. macrocarpa), Northern Red Oak (Q. rubra), and white oak (Q. alba). They’ve dropped most of their acorns. It was a productive year so there seem to be thousands of acorns from each species!

There are approximately the same number of acorns from each species on the ground, and they seem to be evenly spread around.

You collect an acorn, place it in your left pocket, walk a short distance and collect a second acorn placing it in your right pocket.

Some things to consider when describing your sample space?

Assume that two acorns of the same species are indistinguishable.
In your 2-acorn draw, what is an event?
How many elements are in each possible event?
Does the order or arrangement of acorns matter?

Q11 (1 pt.): What is the probability that the acorn in your left pocket is Q. alba?
Q12 (1 pt.): What is the probability that the acorn in your right pocket is Q. macrocarpa?
Q13 (2 pts.): If you already know that the acorn in your left pocket is Q. alba, what is the probability that the acorn in your right pocket is also Q. alba?
Q14 (2 pts.): What is the probability that both acorns are Q rubra?
Q15 (2 pts.): What is the probability that you collected exactly one each of Q. alba and Q. rubra?
Q16 (2 pts.): What is the probability that the acorn in your left pocket is Q. alba and you have an acorn of Q. rubra in your right pocket?

Binomial and Poisson: Q17 - Q20

For the questions below consider two discrete probability distributions, parameterized as:

a Poisson distribution with \(\lambda = 6\)
a Binomial distribution with \(n = 10\) and \(p = 0.6\).

Q17 (1 pt.): Which of the following is the size of the sample space of this Poisson distribution?
- 0
- 2
- 6
- 10
- 11
- \(\infty\)
Q18 (2 pts.): Which of the following is the size of the sample space of this Binomial distribution?
- 0
- 2
- 6
- 10
- 11
- \(\infty\)
Q19 (2 pts.): Describe a characteristic that is common to both the Binomial and Poisson distributions that makes them good models for counts.
Q20 (2 pts.): Hypothesize a scenario in which a Binomial distribution may be a better count model than a Poisson distribution.

Report

Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:

Your name
The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 6: Frameworks. October 11, 13

Topic Highlights

Distribution Functions
Continuous Distributions
The Frequentist statistical perspective
- Brief intro to other paradigms
Confidence and significance

Readings

Bolker ch 1: Introduction and Background
- Read section 1.4: Frameworks for Statistical Inference.
- Focus your attention on 1.4.1
McGarigal Ch 6a, 6b, 6c

Tuesday

Start Deck 5
In-Class Probability 3: Calculations

Thursday

Continue Deck 5
Finish In-Class Probability 3: Calculations (if needed)

Lab

Week 6 Questions

Null Hypothesis: Q1:

The Bolker reading was difficult….

Bolker used a seed predation experiment to illustrate the statistical frameworks.

The primary question in his examples is: Do seed predation rates vary among species?

Reminder: in the Frequentist paradigm, a null hypothesis can be used as a baseline against which you can compare your observations.

Q1 (3 pts.): In a short paragraph, describe a baseline scenario regarding seed predation. At the end, state the null hypothesis for seed predation.

Seed Predation Rates: Q2 -

I have found that recreating the calculations in a difficult reading helps me understand and follow difficult readings.

In that spirit, you’ll use data presented in Bolker Table 1.2 (on the top of page 11) to calculate the seed predation rates.

Here’s some template R code you can use to get started:

# Clear your R environment to make 
# sure there are no stray variables.

rm(list = ls())

pol_n_predation = 26
pol_n_no_predation = 184
pol_n_total = ????
pol_predation_rate = ????

psd_n_predation = ????
psd_n_no_predation = ????
psd_n_total = ????
psd_predation_rate = ????

Self-test: Run the following code after you have made your calculations. Your rates should match the observed proportions in the Bolker text on page 11.

print(
  paste0(
    "The seed predation rate for Polyscias fulva is: ",
    round(pol_predation_rate, digits = 3))) 

print(
  paste0(
    "The seed predation rate for Pseudospondias microcarpa is: ",
    round(psd_predation_rate, digits = 3)))

pol_n_predation = 26
pol_n_no_predation = 184
pol_n_total = 210
pol_predation_rate = pol_n_predation/pol_n_total
psd_n_predation = 25
psd_n_no_predation = 706
psd_n_total = 731
psd_predation_rate = psd_n_predation/psd_n_total

predation_ratio = pol_predation_rate/psd_predation_rate
round(predation_ratio, digits = 3)

## [1] 3.62

Hint: To make sure that your code is written correctly, you need to run it in a empty R environment.

The call rm(list = ls()) on the first line of the code template removes all variables from the environment. Make sure you include it in your code!

Q2 (3 pts.): Paste the R code you used to complete the table and calculate the rates.

Seed Predation Table: Q3

Create a table and fill in the missing values:

species	Any taken	None taken	N	Predation rate
Polyscias fulva (pol)	26	184	__	__
Pseudospondias microcarpa (psd)	__	__	__	__

Q3 (3 pts.): Show your table with the missing values filled in.

Seed Predation Ratio: Q4

Use the seed predation proportions you calculated to determine the ratio of seed predation proportions.

Things to consider:

Which rate should be in the denominator?
Predation proportions (predation rates) are different than odds ratios.

Q4 (2 pts.): Report the seed ratio of seed predation proportions and show the R code you used to do the calculation.

Report

Save your answers in a pdf document (or a knitted html document) and upload to Moodle. Make sure you include:

Your name
The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 7: Confidence and Sampling Distributions. October 18, 20

Topic Highlights

Confidence Intervals
Frequentist Statistical Significance
Sampling Distribution

Tuesday

Finish Deck 5
In-class Confidence Intervals activity

Thursday

Introduce final projects
Start Deck 6
Finish In-class Confidence Intervals activity
In-Class populations, samples, and sampling

Readings

this week’s readings are:

McGarigal 6c: Confidence Interval Primer

A few things to remember about Frequentist Confidence Intervals:

The width of confidence intervals is influenced by properties of both the population and the samplinng process.

Recall that we are not 95% certain that a 95% confidence interval we calculate contains the true value.

Questions

For Questions 1 - 4, assume you are working with a population that is normally-distributed with mean \(\mu\) and standard deviation \(\sigma\). Note that although these population parameters exist, you cannot know their exact values and you must estimate them through sampling.

Q1 (1 pt.): Explain the effect, if any, of the population mean on the width of CIs for a population that is normally-distributed. If population mean does not affect the widths of CIs explain why not.
Q2 (1 pt.): Explain the effect, if any, of the population standard deviation on the width of CIs. If population standard deviation does not affect the widths of CIs explain why not.
Q3 (1 pt.): Explain the effect, if any, of the population size on the width of CIs. If population size does not affect the widths of CIs explain why not.
Q4 (1 pt.): Explain the effect, if any, of the sample size on the width of CIs. If sample size does not affect the widths of CIs explain why not.
Q5 (4 pts.): Interpreting a CI. Use a narrative example of a real (or made up) dataset to describe what a Frequentist 95% confidence interval really means.
- Make sure you cover any relevant assumptions of the Frequentist paradigm.
- You answer must be in non-technical language.
- Imagine you were explaining confidence intervals to an audience of teenagers, or perhaps a family member who doesn’t have training in statistics.

Your explanation will be more successful if you use an example or describe your answer in the context of a real-life scenario rather than a purely theoretical explanation.

Report

Save your answers in a pdf document and upload to moodle. Make sure you include:

Your name
The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 8: Frameworks. October 25, 27

Topic Highlights

Inference
Measures of Fit
Likelihood

Tuesday

Finish deck 6
In-Class Likelihood

Thursday

Start deck 7 - through t-tests (fingers crossed)
Finish up stray in-class stuff.

Readings

this week’s readings are:

Slide deck 6
Slide deck 7 (through t-tests)
McGarigal Ch 7: Nonparametric Inference : Ordinary Least Squares and More
McGarigal Ch 8: Maximum Likelihood Inference
Optional Skim McGarigal Ch 8: Bayesian Inference

Q1: Parametric/Non-Parametric

Refer back to sections 7.1 and 8.2 for McGarigal’s descriptions of the form of the linear statistical model for the non parametric OLS and parametric likelihood-based inference techniques.

Recall that he used the same data to illustrate both paradigms: Brown creeper abundance (response) and proportion of late successional forest (predictor).

Note: McGarigal specifies the parametric model using this notation:

\(Y \sim Normal(a + bx, \sigma)\)

However, both the parametric and non parametric model can be expressed in the more familiar regression model format:

\(y_i = \beta_0 + \beta_1 x_i + e_i\)

Q1 (1 pt.): Describe the key difference between the non parametric model (Ch. 7.1) and the parametric model (Ch. 8.1)

Q2: Interpolation/Extrapolation

Interpolation and extrapolation may both be used to make predictions.

Q2 (1 pt.): What is the difference between interpolation and extrapolation?

Q3: Interpolation/Extrapolation

Q3 (1 pt.): Explain why extrapolation has more pitfalls than interpolation.

Report

Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:

Your name
The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 9: Frequentist Linear Models. November 1, 3

Topic Highlights

General Linear Models
Assumptions
Intro to the Constellation

Tuesday

Continue deck 7

Thursday

Finish Deck 7
In-class t-tests.

Readings

this week’s readings are:

Slide Deck 7
McGarigal Ch. 11a: Landscape of Statistical Methods: Part 1
- Read sections 1 and 2 (we’ll come back to the rest later)
Bolker Ch. 9: Standard Statistics Revisited
- Read sections 9.1 - 9.2
Zuur Ch. 5: Linear Regression
- Read sections 5.1 - 5.2

Q1: Modeling Approach

“In the best case, your data will match a classical technique like linear regression exactly, and the answers provided by classical statistical models will agree with the results from your likelihood model.” - Bolker (2008)

Bolker describes custom-made analyses based on Maximum Likelihood, which often have a biological, ecological, or mechanistic justification.

He contrasts these with the familiar Least Squares, canned methods that we typically learn in our first statistics course.

Q1 (1 pt.): Briefly (1 - 2 short paragraphs) describe at least two tradeoffs between the customized ML methods and the canned methods.

Q2: Assumptions

Q2 (1 pt.): Briefly (1 - 2 sentences) describe each of the four key assumptions of the general linear modeling approach.

Q3: Normality

“The normality assumption means that if we repeat the sampling many times under the same environmental conditions, the observations will be normally distributed for each value of X.” - Zuur (2007)

A common misconception about this assumption is that the values of the response variable must be normally distributed.

Consider this histogram of penguin bill lengths:

Bill lengths appear very non-normal.

A the very low p-value in the Shapiro test of normality provides strong evidence against the null hypothesis that bill lengths are normally-distributed.

shapiro.test(penguins$bill_length_mm)


    Shapiro-Wilk normality test

data:  penguins$bill_length_mm
W = 0.97485, p-value = 1.12e-05

Nevertheless, a general linear model of bill lengths as that includes species and body mass as predictors passes a test of the normality assumption:

fit_1 = lm(bill_length_mm ~ body_mass_g + species, data = penguins)
shapiro.test(residuals(fit_1))


    Shapiro-Wilk normality test

data:  residuals(fit_1)
W = 0.99317, p-value = 0.123

Q3 (1 pt.): Explain how the normality assumption can be met in a general linear model, even if the response variable is not normally-distributed. (1 - 2 paragraphs)

Report

Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:

Your name
The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 10: Frequentist Linear Models. November 8, 10

Topic Highlights

Model interpretation
Model Validation
Model comparison and selection
Prep for Ginkgoes!

Tuesday

Begin Deck 8
In-class model coefficients.

Thursday

Ginkgo leaf collection and quantification

Readings

this week’s readings are:

Review Deck 7

Q1: Model Selection

“The first part of the AIC definition is a measure of goodness of fit. The second part is a penalty for the number of parameters in the model.” - Zuur (2007)

Q1 (1 pt.): Why would we want a model selection criterion to penalize the number of parameters in a model?

Q2: Interpreting a Slope

Consider the regression equation for a simple linear regression:

\(y_i = \alpha + \beta_1 x_i + \epsilon\)

Q2 (3 pts.): In 2 - 3 short paragraphs, describe the meaning of the slope parameter \(\beta_1\) in the context of the relationship between the predictor variable, x, and the response variable y.

Your answer must be in plain non-technical language. Your explanation will be most effective if you use a narrative approach, using a concrete example to illustrate the concept.

Interpreting a Coefficient Table 1 Q3 - Q5

Consider an experiment looking at plant biomass response to water treatments.

The three water level treatments are: “low”, “med”, and “high”.
The response is plant biomass in grams after 7 weeks of growth.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2.4	2.19	98.371	0.001
waterMed	1.3	5.12	0.480	0.231
waterHigh	13.6	3.48	24.495	0.001

Q3 (1 pt.): Based on the model table, what is the base case water treatment?
Q4 (2 pts.): What is the average plant mass, in grams, for the low water treatment?
- How did you calculate this quantity?
Q5 (2 pts.): What is the average plant mass, in grams, for the medium water treatment?
- How did you calculate this quantity?

Q6: Coefficient Interpretation

Q6 (1 pt.): Which of the following questions cannot be addressed with the model coefficient table? Select the correct answer or answers:

Is there a positive relationship between increased water availability and plant biomass accumulation?
Is water availability a significant predictor for plant biomass accumulation?
What is the average biomass of plants in the high water treatment?

Report

Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:

Your name
The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 11: Nov 15, 17

Topic Highlights

Tuesday

Finish Deck 8
Start Deck 9
Ginkgo Data in-class

Thursday

Intro to Final Projects
Finish Deck 9

Readings

this week’s readings are:

Review the readings for General Linear Models
Slide Deck 9

No Reading Questions

Week 12: Frequentist Linear Models and Beyond November 29, December 1

Topic Highlights

Recap of General Linear Models
Bayesian Perspective
Statistical Power
Logistic Regression
Your Review Topics

Tuesday

Start Deck 10

Thursday

Finish Deck 10
Start Deck 11

Readings

this week’s readings are:

Review Zuur 5.1 and 5.2
Slide Deck 10
Slide Deck 11

Q1: Model Comparison

McGarigal writes:

We expect a model with more parameters to fit better in the sense that the negative log-likelihood should be smaller if we add more terms to the model. But we also expect that adding more parameters to a model leads to increasing difficulty of interpretation.

Q1 (2 pts.): In the context of a dataset (real or made up), describe the inherent conflict between using a complicated model that minimizes the unexplained variation and using a simple model that is easy to communicate.

Consider the trade off between model complexity and interpretability.

Since your answer is targeted to a non-scientist audience, you should use narrative style using a concrete example.

Interpreting a Coefficient Table 2 Q2 - Q4

Consider this table of model coefficients from a plant growth experiment with the following continuous predictor variables.

Note: The amount of water, N, and P given to each plant was randomized at the beginning of the experiment.

Water: 3 - 30 mL per week
Nitrogen: 1.1 - 42 mg per week
Phosphorus: 1.1 - 42 mg per week

The response variable was plant biomass accumulation (in grams).

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-1.7	0.23	98.371	0.061
water	0.043	0.001	0.480	0.021
nitro	0.192	0.034	1.495	0.007
phosph	-0.027	0.014	0.091	0.721

Q2 (1 pt.): Which of the following predictor variables had slope coefficients that were significantly different from zero at a 95% confidence level? Select the correct answer(s)

water
nitrogen
phosphorus
None

Q3 (2 pts.): Using the information in the model coefficient table above, calculate the expected biomass for a plant given:
0 mL water per week
0 mg nitrogen per week
0 mg phosphorus per week

Explain how you made the calculation.

Q4 (2 pts.): Using the information in the model coefficient table above, what is the expected biomass for a plant given:
10 mL water per week
30 mg nitrogen per week
20 mg phosphorus per week

Explain how you made the calculation.

Q5: Regression and ANOVA

Q5 (1 pt.): Describe the key difference between a simple linear regression and a 1-way analysis of variance.

Consider the data types/scales of the predictor and response variables.

Q6: Regression Equation as a Dual Model

We often present the equation for a simple linear regression model as:

\(y_i = \alpha + \beta_1 x_i + \epsilon\)

Q6 (1 pt.): Identify the deterministic component(s) of the model equation.

Q7: Regression Equation as a Dual Model

We often present the equation for a simple linear regression model as:

\(y_i = \alpha + \beta_1 x_i + \epsilon\)

Q7 (1 pt.): Identify the stochastic component(s) of the model equation.

Report

Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:

Your name
The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 13: Constellation of Methods, Course Recap, Simulation, and beyond December 6, 8

Topic Highlights

Course Topic Recap
Your Review Topics
Generalized Linear Models

Tuesday

Continue Deck 11

Thursday

Finish Deck 11

Readings

This week’s readings are:

Lecture Notes

Please note that these lecture decks may be updated to correct spelling or other errors, fix formatting, include additional content, etc.

I will endeavor to make the slide decks available at least one week prior to the lecture session in which they will be used.

Deck 1: Course Intro
Deck 2: Model Thinking and Data
Deck 3: Data Exploration, Functions, and Associations
- Numerical and graphical data exploration
- Associations
- Intro to distributions
Deck 4: Distributions: Notation, Functions, and Probability
- Distribution functions
- Continuous and discrete distributions
Deck 5: Frequentist Hypotheses and Confidence
- Hypothesis testing
- Standard errors
- Confidence intervals
Deck 6: Frameworks: Least Squares, Likelihood, Frequentist, Bayesian
- Least squares
- Likelihood
- Resampling
Deck 7: Regression Modeling
- Principles of regression modeling
- Group 1 models (General Linear Models)
- Model coefficients
- ANOVA tables
- Factors and continuous predictors
Deck 8: Beyond The General Linear Model
- Limitations of general linear models
- Violations of assumptions
- The constellation of models
Deck 9: Interactions, Dummy Variables, and Model Interpretation
- Dummy variables
- Factors and coefficients
- Statistical power
Deck 10: Conditional Probability and Intro to Bayesian Perspective
- Preview of Bayesian Thinking
- Conditional Probability Basics

Lab Notes

Loops
Functions
RMarkdown Code Chunks
- Example markdown document from lab. Note this is a zip file, you’ll have to unzip it before you can open it in RStudio.
Using aggreate() and R formulas

In-Class R Demos and Examples

Lab 1 Demos

Lab 1, Sep 1

Fancy Histograms

You can use the special character \n to insert a line break into a title or axis label.
You can use the col argument to specify a color for the bars.
You can use the border argument to specify the color of the outlines of the bars.
You can use the adjustcolor() function to make a color lighter by specifying an alpha value of less than 1.

# In-Class Fancy Histogram Demo

require(palmerpenguins)

hist(
  penguins$bill_length_mm,
  main = "Hist 'o Gram of Bill Length\nBy Mike Nelson",
  col = 
    adjustcolor(col = "steelblue", alpha.f = .4),
  border = "red",
  xlab = "Bill Length (in mm)")

Histogram x-axis limits

If a histogram has some very short bins at high values of x, you can truncate the display using the xlim argument. For example, if you had some data stored in a data frame and you wanted to make a histogram of the column called wingspan:

This doesn’t look great:

hist(
  dat$wingspan,
  main = "Histogram of Wingspan",
  xlab = "wingspan (cm)")

I can truncate the x-values to be between 0 and 10 using xlim.

I’ll also make more bins using the breaks argument.

Note that R considers the breaks argument to be a suggestion… Telling it to create 30 bins won’t always result in 30 bins. You can experiment with different numbers of bins until you find a number that works for your plot.

hist(
  dat$wingspan,
  main = "Histogram of Wingspan",
  xlab = "wingspan (cm)",
  xlim = c(0, 10),
  breaks = 30)

In-Class Activities

Click the links for details about each assignment.

Lecture Assignments - ECO 602

Click the links for details about each assignment.

Learning Objectives Assessment (This is a Quiz within Moodle)
Software Setup
Using RNotebooks
DataCamp: Intro to R
Data exploration and deterministic functions
Frequentist Concepts
Using Models 1
Using Models 2
GitHub and Course Webpage

Lab Assignments - ECO 634

Click the assignment name for walkthrough and questions

Final Projects/Take-Home Questions

You can find the final project instructions and problems here.

The final problem set consists of two parts:

A comprehensive R guide.
A complete data analysis

Tips, Tricks, and Walkthroughs

Here are some supplemental links to helpful resources for various topics covered in class. I’ll continually update this list. If you find a resource that was helpful for you, let me know and I’ll add it here!

R Markdown Crash Course by Zachary M. Smith
- This section on YAML headers may be particularly useful!
Reading data with here()
We only briefly mention the ggplot paradigm in the course, but this Complete ggplot2 Tutorial may be useful for those who want to learn more.

Analysis of Environmental Data at the University of Massachusetts, Amherst

Michael France Nelson

Welcome

About the lecture

Check out the lecture syllabus here

About the lab

Course Materials

Readings

McGarigal Readings

Computer Resources

DataCamp

Assignment Data Files

Weekly Schedule and Reading Questions

Week 1: Introductions. September 6, 8

Topic Highlights

Required Readings

Tuesday Sep 6

Thursday Sep 8

Lab

Week 2: Data and Model Thinking. September 13, 15

Week 2 Questions

Q1: Dichotomies

Q2: Assumptions and Biases

Q3: Dual Model Paradigm

Q4: Populations

Q5: Model Thinking

Report

Week 3: Data Exploration. September 20, 22

Week 3 Questions

Q1: Plots

Q2: Plots

Q3: Conditioning Variables

Dispersion: Q4 - Q5

Q6: Data Exploration

Report:

Week 4: Functions. September 27, 29

Week 4 Questions

Q1: Predictors

Q2: Responses

Q3: Model Constraints

Q4: Predator-Prey Model

Report

Week 5: Probability Distributions. October 4, 6

Week 5 Questions

Warm-Up Questions: Q1 - Q7

Simultaneous Acorns 1: Q8 - Q10

Sequential Acorns Q11 - 16

Binomial and Poisson: Q17 - Q20

Report

Week 6: Frameworks. October 11, 13

Week 6 Questions

Null Hypothesis: Q1:

Seed Predation Rates: Q2 -

Seed Predation Table: Q3

Seed Predation Ratio: Q4

Report

Week 7: Confidence and Sampling Distributions. October 18, 20

Questions

Report

Week 8: Frameworks. October 25, 27

Q1: Parametric/Non-Parametric

Q2: Interpolation/Extrapolation

Q3: Interpolation/Extrapolation

Report

Week 9: Frequentist Linear Models. November 1, 3

Q1: Modeling Approach

Q2: Assumptions

Q3: Normality

Report

Week 10: Frequentist Linear Models. November 8, 10

Q1: Model Selection

Q2: Interpreting a Slope

Interpreting a Coefficient Table 1 Q3 - Q5

Q6: Coefficient Interpretation

Report

Week 11: Nov 15, 17

Week 12: Frequentist Linear Models and Beyond November 29, December 1

Q1: Model Comparison

Interpreting a Coefficient Table 2 Q2 - Q4

Q5: Regression and ANOVA