Welcome

This page contains information and supporting documents for the Analysis of Environmental Data lecture and lab courses at the University of Massachusetts Amherst.

Analysis of Environmental Data (ECo 602/634) is a core course for master’s and Ph.D. students in the Department of Environmental Conservation at Mass Amherst.

About the lecture

Check out the lecture syllabus here

This course provides students with an understanding of basic statistical concepts critical to the proper use and understanding of statistics in ecology and conservation science and prepares students for subsequent ECO courses in ecological modeling. The lecture (required for all ECO Master’s level graduate students) covers foundational concepts in statistical modeling (emphasis is on conceptual underpinnings of statistics not methodology, with a focus on defining statistical models and the major inference paradigms in use today), basic study design concepts (emphasis is on confronting practical issues associated with real-world ecological study designs and statistical modeling), and lays out the ‘landscape’ of statistical methods for ecological modeling; emphasis is on the conceptual underpinnings of statistical modeling instead of methodology, with a focus on defining.

About the lab

This laboratory course introduces the statistical computing language R and provides hands-on experience using R to screen and adjust data, examine deterministic functions and probability distributions, conduct classic one- and two-sample tests, utilize bootstrapping and Monte Carlo randomization procedures, and conduct stochastic simulations for ecological modeling.

Specifically, lab focuses on learning the R language and statistical computing environment, which serves as the computing platform for all ECO statistics courses; emphasis is on learning fundamental R skills that will allow students to grow and expand their expertise in subsequent courses or on their own.

Course Materials

Readings

The course draws readings from diverse texts and journal articles including:

  • Bolker, B.M. (2008). Ecological models and data in R (Princeton University Press). [Electronic version available at UMass Libraries]
  • Zuur, A.F. (2007). Analyzing ecological data (New York; London: Springer). [Electronic version available at UMass Libraries]
  • Epstein, J.M. (2008). Why Model? Journal of Artificial Societies and Social Simulation 11, 12.
  • Bang, Megan, Ananda Marin, and Douglas Medin. If Indigenous Peoples Stand with the Sciences, Will Scientists Stand with Us? Daedalus 147, no. 2 (March 1, 2018): 148–59.
  • Jorge Luis Borges: The Library of Babel

Computer Resources

Computer programs and other resources used in the course include:

DataCamp

This class is supported by DataCamp, the most intuitive learning platform for data science and . Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 325+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 5 million learners around the world and close your skills gap.


Weekly Schedule and Reading Questions

You should complete the readings for the week before the first lecture on Tuesday. Completing the readings before class will enable you to more fully engage with the in-class activities and discussions.

Reading question sets will be due the following Sunday by 11:59PM. For example the reading questions for week 2 are due on Sunday Sep. 18th at 11:59PM.

Week 1: Introductions. September 6, 8

Topic Highlights

  • Introduction to model thinking

Required Readings

This week’s readings are:

  • Epstein: Why Model?
  • McGarigal ch. 1
  • Lecture slide deck 1

Tuesday Sep 6

  • Course introduction
  • Intro to model thinking

Thursday Sep 8

Lab

  • Start labs 1 and 2

There are no week 1 reading questions

Week 2: Data and Model Thinking. September 13, 15

Topic Highlights

  • Data!
  • Sampling: populations, samples
  • Models and model thinking.
  • Preview of Frequentism. Don’t worry, we’ll revisit these concepts many times.

Readings and Questions

This week’s readings are:

  • Slide decks 1 and 2
  • McGarigal ch. 1:
  • McGarigal Chapter 2: Environmental Data
  • Bolker Chapter 1: Intro and Background
    • Read sections 1.1, 1.2, 1.3
  • Bang, Megan, Ananda, Marin, and Medin

Tuesday

  • Finish deck 1, start deck 2

Thursday

Lab

  • Continue Labs 1 and 2

Week 2 Questions

Q1: Dichotomies

This question draws mostly upon materials from the Bolker reading: Chapter 1, sections 1.1 - 1.3, and the ideas in the Model Thinking lecture and notes.

  • Many of the terms in the table will be unfamiliar. Please don’t let that discourage you!
  • The Bolker and Zuur texts are classics, but they are both very dense and difficult reads (even for me).

Choose one of the modeling dichotomies that Bolker writes about in sections 1.1 - 1.3 (summarized in table 1.1 on page 6).

  • Q1 (2 pts.): In 1 - 2 short paragraphis, explain the dichotomy in your own words and briefly describe how you might approach one of your research interests from each of the dichotomy endpoints.

Q2: Assumptions and Biases

This question draws mostly upon materials from the McGarigal chapter 1 slides.

  • Concepts in the Bang et al. paper may also provide insight.

As used in common language, the terms assumption and bias are usually used to describe negative, value-laden concept that we should avoid.

Our usage of these terms is slightly different. In this course we’re more interested in identifying important (and sometimes hidden) assumptions and biases. Sometimes we gain important insights simply by identifying assumptions and biases in our modeling process.

Just like uncertainty, we can improve our ability to detect hidden assumptions and biases. We can then be more informed modelers.

McGarigal states in chapter 1:

“Western science and our society requires that challenges to the status quo be empirically and rigorously demonstrated (analogy:”innocent until proven guilty“)”

“… This is because we live in a world where challenges to the status quo are given little credence without solid evidence for the alternative (analogy: innocent until proven guilty).”

“Whether we are presenting our findings to a scientific audience (e.g., in a scientific journal) or to managers, policy-makers, or the general public, we are increasingly asked to defend our conclusions on the basis of statistical evidence.”

  • These seem like great ideals to strive for, right? But are there any biases in these statements?

Part of being a good modeler is identifying biases: implicit, cultural, scientific, etc.

Often these biases come from implicit assumptions that we don’t even know we’re making!

Consider some potential assumptions and/or biases in the above quotes, and in the description of the four testimonials regarding climate change and bird nesting habitat.

  • Q2 (2 pts.): Identify at least one source of bias or assumption (cultural, scientific, other). Hypothesize a practical impact these biases or assumptions might have on scientific communication and the effectiveness of management efforts? (1 - 3 paragraphs)

Q3: Dual Model Paradigm

This question draws mostly upon materials from the McGarigal Chapter 1 and Bolker Chapter 1 readings.

  • Q3 (2 pts.): In 1 - 2 short paragraphs, describe the following:
    • Identify and briefly define the two primary components of a model constructed in the dual model paradigm.
    • Give an example of the two components in the context of a system you are interested in studying.

Q4: Populations

This question draws mostly upon materials from the McGarigal chapter 2 slides.

  • Q4 (2 pts.): In 1 - 2 short paragraphs, describe the difference between a statistical and biological or ecological population.
    • Which of these populations may vary depending on the spatial or temporal scale of the research question?

Q5: Model Thinking

This question is related McGarigal chapter 2 slides, and the in-class group model thinking activity.

Consider the scenario your group chose to use in the model thinking in-class activity:

  • Cascades snow pack
  • White pine blister rust
  • Cattails

Choose 2 of the of the following data types and scales.

  1. A continuous variable on an ratio scale
  2. A categorical, nominal variable
  3. A discrete variable
  4. A numerical variable on an interval scale

For each of your chosen variable type/scale types:

  • Propose an entity and/or variable in your scenario that you could measure using the data type/scale.
  • Explain why the data type or scale is appropriate for the entity/variable you chose.

For example, if I were studying herbivory of of Monarch caterpillars on different species of Milkweeds (Asclepias spp) I might measure the host plant species on a categoical/nominal scale.

I could measure amount of herbivory on a ratio scale because a value of zero herbivory is meaningful.

Remember, you only need to chose 2 of the 4 variable types/scales.

  • Q5 (2 pts.): For each of your two chosen variables: Describe your proposed entity or variable and explain your chosen data type/scale is appropriate.

Report

Save your answers in a pdf document and upload to Moodle. Make sure you include:

  • Your name
  • The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

:::

Week 3: Data Exploration. September 20, 22

Topic Highlights

  • Data exploration
    • Graphical and numerical
  • Plot types
  • Data in R: Import, types, data structures

Tuesday

Thursday

Make sure to read the in-class assignment directions before you come to class!

Lab

  • Start lab 3

Readings and Questions

this week’s readings are:

  • McGarigal Chapter 3: Exploration, sections 1 - 5
  • Zuur Chapter 4: Exploration, section 4.1.
    • You don’t need to read about the last two plot types: lattice graphs and design/interaction plots
  • Bolker Chapter 2: Data Analysis and Graphics
    • Read Sections 2.1 through 2.3. There are a lot of helpful R examples and tips.

Week 3 Questions

Q1: Plots

Consider the following types of plots described in the McGarigal and Zuur readings:

  • Histogram
  • Scatterplot
  • Cleveland dotplot
  • Boxplot
  • coplot
  • Q1 (1 pt.): Which of the plot types show every data point?

Q2: Plots

Consider the following types of plots described in the McGarigal and Zuur readings:

  • Histogram
  • Scatterplot
  • Cleveland dotplot
  • Boxplot
  • coplot
  • Q2 (1 pt.): Which of the plot types show aggregated or summarized data?

Q3: Conditioning Variables

Conditional plot, conditioning variable, and related terms occurred throughout the Zuur and McGarigal readings.

  • Q3 (3 pts.): Explain what a conditional variable means in the context of graphical data exploration.

Dispersion: Q4 - Q5

  • Q4 (1 pt.): List at least three of the common measures of spread or dispersion that were mentioned in the readings.
  • Q5 (2 pts.): Choose two of the measures in your list and explain how they capture different aspects of the concept of spread.

Q6: Data Exploration

Consider a dataset that you have collected or worked with.

If you haven’t worked much with existing datasets hypothesize a dataset that you might collect for your research.

  • Q6 (5 pts.): List two of the important reasons to perform data exploration (numerical and/or graphical).
    • For each of the two reasons you identify, describe the quantities or plots you would use and the insight you would gain.

Report:

Save your answers in a pdf document and upload to Moodle. Make sure you include:

  • Your name
  • The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 4: Functions. September 27, 29

Topic Highlights

  • Deterministic Functions
  • Classes of Functions
  • Function Intuition

Tuesday

Thursday

Lab - Start Lab 4 - Continue Lab 3

Readings and Questions

this week’s readings are:

  • McGarigal Chapter 4: Deterministic Functions
  • Bolker Chapter 3.Deterministic Functions for Ecological Modeling
    • Read the chapter introduction (Section 3.1)

Week 4 Questions

Q1: Predictors

McGarigal presented two studies of Brown creepers:

  1. A model of Brown creeper abundance explained by late-successional forest percent.
  2. A model of Brown creeper presence/absence explained total basal area (a measure of tree cover).

For each model:

  • Consider what types of data were collected in each study:
    • Are they continuous, discrete, categorical?
    • What is the data scale?
  • What kind of deterministic function is used.
  • Q1 (2 pts.): For both models (abundance and presence/absence) identify:
    1. The predictor variable(s).
    2. The data type/scale used for the predictor variable.

Q2: Responses

McGarigal presented two studies of Brown creepers:

  1. A model of Brown creeper abundance explained by late-successional forest percent.
  2. A model of Brown creeper presence/absence explained total basal area (a measure of tree cover).

For each model:

  • Consider what types of data were collected in each study:
    • Are they continuous, discrete, categorical?
    • What is the data scale?
  • What kind of deterministic function is used.
  • Q2 (2 pts.): For both models (abundance and presence/absence) identify:
    1. The response variable.
    2. The data type/scale used for the response variable.

Q3: Model Constraints

McGarigal presented two studies of Brown creepers:

  1. A model of Brown creeper abundance explained by late-successional forest percent.
  2. A model of Brown creeper presence/absence explained total basal area (a measure of tree cover).

For each model:

  • Consider what types of data were collected in each study:
    • Are they continuous, discrete, categorical?
    • What is the data scale?
  • What kind of deterministic function is used.
  • Q3 (4 pts.): For both models: How did the data type or scale influence or constrain the choice of model?

Q4: Predator-Prey Model

McGarigal presented a simulated example of density-dependent predator-prey interactions in which he fit several different models to the data.

Consider only the Ricker and quadratic models.

Some concepts to keep in mind:

  • mechanistic vs. phenomenological
  • goodness-of-fit
  • previous knowledge of predator-prey interactions
  • Q4 (1 pt.): What are the pros and cons of the Ricker model? What are the pros and cons of the quadratic model?

Report

Save your answers in a pdf document and upload to Moodle. Make sure you include:

  • Your name
  • The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 5: Probability Distributions. October 4, 6

Topic Highlights

Readings

  • Bolker ch. 1: Intro and Background
    • Read section 1.6: Outline of the Modeling Process
  • McGarigal 5: Probability Distributions
  • Jorge Luis Borges: La Biblioteca de Babel (The Library of Babel)
    • Don’t worry, you can read the English translation!
  • Optional Bolker chapter 4: Probability and Stochastic Distributions for Ecological Modeling
    • This is a much more in-depth and technical overview than we will be covering in class.
    • This reading is totally optional, I’ve provided it as a reference only in case you’re interested.

Tuesday

Thursday

Lab

Week 5 Questions

Warm-Up Questions: Q1 - Q7

  • Q1 (2 pts.): Choose the best words or phrases to fill in the blanks: A probability distribution is a map from the (a)_____ to the (b)_____.

  • Q2 (2 pts.): How many possible outcomes are there (i.e. what is the sample space) if you flip two coins sequentially: a penny and a quarter? Assume that

    • the two coins each have a head and a tail
    • you care about order, but you may flip either coin first.
    • the probability of heads or tails is about 0.5 for each coin.

  • Q3 (2 pts.): How many possible outcomes are there (i.e. what is the sample space) if you flip two quarters at the same time? Assume that

    • the two coins are indistinguishable
      • i.e. you just want to know the number of heads or tails for each possible outcome.
    • each have a head and a tail
    • the probability of heads or tails is about 0.5 for each quarter.

  • Q4 (2 pts.): How many outcomes are there if you flip a penny three times? If you care about the order of flips, how many possible events are there in the sample space?

  • Q5 (1 pt.): Are the events in the previous question combinations, or permutations?

  • Q6 (2 pts.): Now suppose you don’t care about the order, and you simply want to know about the number of heads when you flip the penny three times. How many possible events are in the sample space?

  • Q7 (1 pt.): Are the events in the previous question combinations, or permutations?

Simultaneous Acorns 1: Q8 - Q10

A sample space is the ….

Suppose it is a beautiful fall day and you are sitting underneath three oak trees: Bur oak (Quercus. macrocarpa), Northern Red Oak (Q. rubra), and White oak (Q. alba). They’ve just started to drop their acorns.

Without looking, you reach down and pick up two acorns in one hand at the same time and shuffle them around before you look.

Describe the sample space of your collection (i.e. enumerate the set of all possible outcomes).

Some things to consider when describing your sample space?

  • Assume that two acorns of the same species are indistinguishable.
  • In your 2-acorn draw, what is an event?
  • How many elements are in each possible event?
  • Does the order or arrangement of acorns matter?
  • Q8 (2 pts.): What is the size of the sample space?
  • Q9 (2 pts.): Given the scenario description, how many ways are to there to collect two acorns of the same species?
  • Q10 (2 pts.): Given the scenario description, how many ways can you collect two acorns of different species?

Sequential Acorns Q11 - 16

A sample space is the ….

Suppose it is a beautiful fall day and you are sitting underneath three oak trees: Bur oak (Quercus. macrocarpa), Northern Red Oak (Q. rubra), and white oak (Q. alba). They’ve dropped most of their acorns. It was a productive year so there seem to be thousands of acorns from each species!

  • There are approximately the same number of acorns from each species on the ground, and they seem to be evenly spread around.

You collect an acorn, place it in your left pocket, walk a short distance and collect a second acorn placing it in your right pocket.

Some things to consider when describing your sample space?

  • Assume that two acorns of the same species are indistinguishable.
  • In your 2-acorn draw, what is an event?
  • How many elements are in each possible event?
  • Does the order or arrangement of acorns matter?
  • Q11 (1 pt.): What is the probability that the acorn in your left pocket is Q. alba?
  • Q12 (1 pt.): What is the probability that the acorn in your right pocket is Q. macrocarpa?
  • Q13 (2 pts.): If you already know that the acorn in your left pocket is Q. alba, what is the probability that the acorn in your right pocket is also Q. alba?
  • Q14 (2 pts.): What is the probability that both acorns are Q rubra?
  • Q15 (2 pts.): What is the probability that you collected exactly one each of Q. alba and Q. rubra?
  • Q16 (2 pts.): What is the probability that the acorn in your left pocket is Q. alba and you have an acorn of Q. rubra in your right pocket?

Binomial and Poisson: Q17 - Q20

For the questions below consider two discrete probability distributions, parameterized as:

  • a Poisson distribution with \(\lambda = 6\)
  • a Binomial distribution with \(n = 10\) and \(p = 0.6\).
  • Q17 (1 pt.): Which of the following is the size of the sample space of this Poisson distribution?

    • 0
    • 2
    • 6
    • 10
    • 11
    • \(\infty\)
  • Q18 (2 pts.): Which of the following is the size of the sample space of this Binomial distribution?

    • 0
    • 2
    • 6
    • 10
    • 11
    • \(\infty\)
  • Q19 (2 pts.): Describe a characteristic that is common to both the Binomial and Poisson distributions that makes them good models for counts.

  • Q20 (2 pts.): Hypothesize a scenario in which a Binomial distribution may be a better count model than a Poisson distribution.

Report

Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:

  • Your name
  • The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 6: Frameworks. October 11, 13

Topic Highlights

  • Distribution Functions
  • Continuous Distributions
  • The Frequentist statistical perspective
    • Brief intro to other paradigms
  • Confidence and significance

Readings

  • Bolker ch 1: Introduction and Background
    • Read section 1.4: Frameworks for Statistical Inference.
    • Focus your attention on 1.4.1
  • McGarigal Ch 6a, 6b, 6c

Tuesday

Thursday

Lab

Week 6 Questions

Null Hypothesis: Q1:

The Bolker reading was difficult….

Bolker used a seed predation experiment to illustrate the statistical frameworks.

The primary question in his examples is: Do seed predation rates vary among species?

Reminder: in the Frequentist paradigm, a null hypothesis can be used as a baseline against which you can compare your observations.

  • Q1 (3 pts.): In a short paragraph, describe a baseline scenario regarding seed predation. At the end, state the null hypothesis for seed predation.

Seed Predation Rates: Q2 -

I have found that recreating the calculations in a difficult reading helps me understand and follow difficult readings.

In that spirit, you’ll use data presented in Bolker Table 1.2 (on the top of page 11) to calculate the seed predation rates.

Here’s some template R code you can use to get started:

# Clear your R environment to make 
# sure there are no stray variables.

rm(list = ls())

pol_n_predation = 26
pol_n_no_predation = 184
pol_n_total = ????
pol_predation_rate = ????

psd_n_predation = ????
psd_n_no_predation = ????
psd_n_total = ????
psd_predation_rate = ????

Self-test: Run the following code after you have made your calculations. Your rates should match the observed proportions in the Bolker text on page 11.

print(
  paste0(
    "The seed predation rate for Polyscias fulva is: ",
    round(pol_predation_rate, digits = 3))) 

print(
  paste0(
    "The seed predation rate for Pseudospondias microcarpa is: ",
    round(psd_predation_rate, digits = 3)))
pol_n_predation = 26
pol_n_no_predation = 184
pol_n_total = 210
pol_predation_rate = pol_n_predation/pol_n_total
psd_n_predation = 25
psd_n_no_predation = 706
psd_n_total = 731
psd_predation_rate = psd_n_predation/psd_n_total

predation_ratio = pol_predation_rate/psd_predation_rate
round(predation_ratio, digits = 3)
## [1] 3.62

Hint: To make sure that your code is written correctly, you need to run it in a empty R environment.

The call rm(list = ls()) on the first line of the code template removes all variables from the environment. Make sure you include it in your code!

  • Q2 (3 pts.): Paste the R code you used to complete the table and calculate the rates.

Seed Predation Table: Q3

Create a table and fill in the missing values:

species Any taken None taken N Predation rate
Polyscias fulva (pol) 26 184 __ __
Pseudospondias microcarpa (psd) __ __ __ __
  • Q3 (3 pts.): Show your table with the missing values filled in.

Seed Predation Ratio: Q4

Use the seed predation proportions you calculated to determine the ratio of seed predation proportions.

Things to consider:

  • Which rate should be in the denominator?
  • Predation proportions (predation rates) are different than odds ratios.
  • Q4 (2 pts.): Report the seed ratio of seed predation proportions and show the R code you used to do the calculation.

Report

Save your answers in a pdf document (or a knitted html document) and upload to Moodle. Make sure you include:

  • Your name
  • The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 7: Confidence and Sampling Distributions. October 18, 20

Topic Highlights

  • Confidence Intervals
  • Frequentist Statistical Significance
  • Sampling Distribution

Tuesday

Thursday

Readings

this week’s readings are:

  • McGarigal 6c: Confidence Interval Primer

A few things to remember about Frequentist Confidence Intervals:

The width of confidence intervals is influenced by properties of both the population and the samplinng process.

Recall that we are not 95% certain that a 95% confidence interval we calculate contains the true value.

Questions

For Questions 1 - 4, assume you are working with a population that is normally-distributed with mean \(\mu\) and standard deviation \(\sigma\). Note that although these population parameters exist, you cannot know their exact values and you must estimate them through sampling.

  • Q1 (1 pt.): Explain the effect, if any, of the population mean on the width of CIs for a population that is normally-distributed. If population mean does not affect the widths of CIs explain why not.
  • Q2 (1 pt.): Explain the effect, if any, of the population standard deviation on the width of CIs. If population standard deviation does not affect the widths of CIs explain why not.
  • Q3 (1 pt.): Explain the effect, if any, of the population size on the width of CIs. If population size does not affect the widths of CIs explain why not.
  • Q4 (1 pt.): Explain the effect, if any, of the sample size on the width of CIs. If sample size does not affect the widths of CIs explain why not.
  • Q5 (4 pts.): Interpreting a CI. Use a narrative example of a real (or made up) dataset to describe what a Frequentist 95% confidence interval really means.
    • Make sure you cover any relevant assumptions of the Frequentist paradigm.
    • You answer must be in non-technical language.
    • Imagine you were explaining confidence intervals to an audience of teenagers, or perhaps a family member who doesn’t have training in statistics.

Your explanation will be more successful if you use an example or describe your answer in the context of a real-life scenario rather than a purely theoretical explanation.

Report

Save your answers in a pdf document and upload to moodle. Make sure you include:

  • Your name
  • The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 8: Frameworks. October 25, 27

Topic Highlights

  • Inference
  • Measures of Fit
  • Likelihood

Tuesday

Thursday

  • Start deck 7 - through t-tests (fingers crossed)
  • Finish up stray in-class stuff.

Readings

this week’s readings are:

  • Slide deck 6
  • Slide deck 7 (through t-tests)
  • McGarigal Ch 7: Nonparametric Inference : Ordinary Least Squares and More
  • McGarigal Ch 8: Maximum Likelihood Inference
  • Optional Skim McGarigal Ch 8: Bayesian Inference

Q1: Parametric/Non-Parametric

Refer back to sections 7.1 and 8.2 for McGarigal’s descriptions of the form of the linear statistical model for the non parametric OLS and parametric likelihood-based inference techniques.

  • Recall that he used the same data to illustrate both paradigms: Brown creeper abundance (response) and proportion of late successional forest (predictor).

Note: McGarigal specifies the parametric model using this notation:

\(Y \sim Normal(a + bx, \sigma)\)

However, both the parametric and non parametric model can be expressed in the more familiar regression model format:

\(y_i = \beta_0 + \beta_1 x_i + e_i\)

  • Q1 (1 pt.): Describe the key difference between the non parametric model (Ch. 7.1) and the parametric model (Ch. 8.1)

Q2: Interpolation/Extrapolation

Interpolation and extrapolation may both be used to make predictions.

  • Q2 (1 pt.): What is the difference between interpolation and extrapolation?

Q3: Interpolation/Extrapolation

  • Q3 (1 pt.): Explain why extrapolation has more pitfalls than interpolation.

Report

Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:

  • Your name
  • The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 9: Frequentist Linear Models. November 1, 3

Topic Highlights

  • General Linear Models
  • Assumptions
  • Intro to the Constellation

Tuesday

  • Continue deck 7

Thursday

Readings

this week’s readings are:

  • Slide Deck 7
  • McGarigal Ch. 11a: Landscape of Statistical Methods: Part 1
    • Read sections 1 and 2 (we’ll come back to the rest later)
  • Bolker Ch. 9: Standard Statistics Revisited
    • Read sections 9.1 - 9.2
  • Zuur Ch. 5: Linear Regression
    • Read sections 5.1 - 5.2

Q1: Modeling Approach

“In the best case, your data will match a classical technique like linear regression exactly, and the answers provided by classical statistical models will agree with the results from your likelihood model.” - Bolker (2008)

Bolker describes custom-made analyses based on Maximum Likelihood, which often have a biological, ecological, or mechanistic justification.

He contrasts these with the familiar Least Squares, canned methods that we typically learn in our first statistics course.

  • Q1 (1 pt.): Briefly (1 - 2 short paragraphs) describe at least two tradeoffs between the customized ML methods and the canned methods.

Q2: Assumptions

  • Q2 (1 pt.): Briefly (1 - 2 sentences) describe each of the four key assumptions of the general linear modeling approach.

Q3: Normality

“The normality assumption means that if we repeat the sampling many times under the same environmental conditions, the observations will be normally distributed for each value of X.” - Zuur (2007)

A common misconception about this assumption is that the values of the response variable must be normally distributed.

Consider this histogram of penguin bill lengths:

Bill lengths appear very non-normal.

A the very low p-value in the Shapiro test of normality provides strong evidence against the null hypothesis that bill lengths are normally-distributed.

shapiro.test(penguins$bill_length_mm)

    Shapiro-Wilk normality test

data:  penguins$bill_length_mm
W = 0.97485, p-value = 1.12e-05

Nevertheless, a general linear model of bill lengths as that includes species and body mass as predictors passes a test of the normality assumption:

fit_1 = lm(bill_length_mm ~ body_mass_g + species, data = penguins)
shapiro.test(residuals(fit_1))

    Shapiro-Wilk normality test

data:  residuals(fit_1)
W = 0.99317, p-value = 0.123
  • Q3 (1 pt.): Explain how the normality assumption can be met in a general linear model, even if the response variable is not normally-distributed. (1 - 2 paragraphs)

Report

Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:

  • Your name
  • The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 10: Frequentist Linear Models. November 8, 10

Topic Highlights

  • Model interpretation
  • Model Validation
  • Model comparison and selection
  • Prep for Ginkgoes!

Tuesday

Thursday

  • Ginkgo leaf collection and quantification

Readings

this week’s readings are:

  • Review Deck 7

Q1: Model Selection

“The first part of the AIC definition is a measure of goodness of fit. The second part is a penalty for the number of parameters in the model.” - Zuur (2007)

  • Q1 (1 pt.): Why would we want a model selection criterion to penalize the number of parameters in a model?

Q2: Interpreting a Slope

Consider the regression equation for a simple linear regression:

\(y_i = \alpha + \beta_1 x_i + \epsilon\)

  • Q2 (3 pts.): In 2 - 3 short paragraphs, describe the meaning of the slope parameter \(\beta_1\) in the context of the relationship between the predictor variable, x, and the response variable y.

Your answer must be in plain non-technical language. Your explanation will be most effective if you use a narrative approach, using a concrete example to illustrate the concept.

Interpreting a Coefficient Table 1 Q3 - Q5

Consider an experiment looking at plant biomass response to water treatments.

  • The three water level treatments are: “low”, “med”, and “high”.
  • The response is plant biomass in grams after 7 weeks of growth.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.4 2.19 98.371 0.001
waterMed 1.3 5.12 0.480 0.231
waterHigh 13.6 3.48 24.495 0.001
  • Q3 (1 pt.): Based on the model table, what is the base case water treatment?

  • Q4 (2 pts.): What is the average plant mass, in grams, for the low water treatment?

    • How did you calculate this quantity?
  • Q5 (2 pts.): What is the average plant mass, in grams, for the medium water treatment?

    • How did you calculate this quantity?

Q6: Coefficient Interpretation

  • Q6 (1 pt.): Which of the following questions cannot be addressed with the model coefficient table? Select the correct answer or answers:
  1. Is there a positive relationship between increased water availability and plant biomass accumulation?
  2. Is water availability a significant predictor for plant biomass accumulation?
  3. What is the average biomass of plants in the high water treatment?

Report

Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:

  • Your name
  • The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 11: Nov 15, 17

Topic Highlights

  • NA

Tuesday

Thursday

  • Intro to Final Projects
  • Finish Deck 9

Readings

this week’s readings are:

  • Review the readings for General Linear Models
  • Slide Deck 9

No Reading Questions

Week 12: Frequentist Linear Models and Beyond November 29, December 1

Topic Highlights

  • Recap of General Linear Models
  • Bayesian Perspective
  • Statistical Power
  • Logistic Regression
  • Your Review Topics

Tuesday

  • Start Deck 10

Thursday

  • Finish Deck 10
  • Start Deck 11

Readings

this week’s readings are:

  • Review Zuur 5.1 and 5.2
  • Slide Deck 10
  • Slide Deck 11

Q1: Model Comparison

McGarigal writes:

We expect a model with more parameters to fit better in the sense that the negative log-likelihood should be smaller if we add more terms to the model. But we also expect that adding more parameters to a model leads to increasing difficulty of interpretation.

  • Q1 (2 pts.): In the context of a dataset (real or made up), describe the inherent conflict between using a complicated model that minimizes the unexplained variation and using a simple model that is easy to communicate.

Consider the trade off between model complexity and interpretability.

Since your answer is targeted to a non-scientist audience, you should use narrative style using a concrete example.

Interpreting a Coefficient Table 2 Q2 - Q4

Consider this table of model coefficients from a plant growth experiment with the following continuous predictor variables.

Note: The amount of water, N, and P given to each plant was randomized at the beginning of the experiment.

  • Water: 3 - 30 mL per week
  • Nitrogen: 1.1 - 42 mg per week
  • Phosphorus: 1.1 - 42 mg per week

The response variable was plant biomass accumulation (in grams).

Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.7 0.23 98.371 0.061
water 0.043 0.001 0.480 0.021
nitro 0.192 0.034 1.495 0.007
phosph -0.027 0.014 0.091 0.721
  • Q2 (1 pt.): Which of the following predictor variables had slope coefficients that were significantly different from zero at a 95% confidence level? Select the correct answer(s)
  1. water
  2. nitrogen
  3. phosphorus
  4. None
  • Q3 (2 pts.): Using the information in the model coefficient table above, calculate the expected biomass for a plant given:

  • 0 mL water per week

  • 0 mg nitrogen per week

  • 0 mg phosphorus per week

Explain how you made the calculation.

  • Q4 (2 pts.): Using the information in the model coefficient table above, what is the expected biomass for a plant given:

  • 10 mL water per week

  • 30 mg nitrogen per week

  • 20 mg phosphorus per week

Explain how you made the calculation.

Q5: Regression and ANOVA

  • Q5 (1 pt.): Describe the key difference between a simple linear regression and a 1-way analysis of variance.

Consider the data types/scales of the predictor and response variables.

Q6: Regression Equation as a Dual Model

We often present the equation for a simple linear regression model as:

\(y_i = \alpha + \beta_1 x_i + \epsilon\)

  • Q6 (1 pt.): Identify the deterministic component(s) of the model equation.

Q7: Regression Equation as a Dual Model

We often present the equation for a simple linear regression model as:

\(y_i = \alpha + \beta_1 x_i + \epsilon\)

  • Q7 (1 pt.): Identify the stochastic component(s) of the model equation.

Report

Save your answers in a pdf document, or a rendered html page, and upload to Moodle. Make sure you include:

  • Your name
  • The names of the students you worked on the questions with. You should also indicate if you didn’t work with any other students.

Week 13: Constellation of Methods, Course Recap, Simulation, and beyond December 6, 8

Topic Highlights

  • Course Topic Recap
  • Your Review Topics
  • Generalized Linear Models

Tuesday

  • Continue Deck 11

Thursday

  • Finish Deck 11

Readings

This week’s readings are:

Lecture Notes

Please note that these lecture decks may be updated to correct spelling or other errors, fix formatting, include additional content, etc.

I will endeavor to make the slide decks available at least one week prior to the lecture session in which they will be used.

Lab Notes

In-Class R Demos and Examples

Lab 1 Demos

Lab 1, Sep 1

Fancy Histograms

  • You can use the special character \n to insert a line break into a title or axis label.
  • You can use the col argument to specify a color for the bars.
  • You can use the border argument to specify the color of the outlines of the bars.
  • You can use the adjustcolor() function to make a color lighter by specifying an alpha value of less than 1.
# In-Class Fancy Histogram Demo

require(palmerpenguins)

hist(
  penguins$bill_length_mm,
  main = "Hist 'o Gram of Bill Length\nBy Mike Nelson",
  col = 
    adjustcolor(col = "steelblue", alpha.f = .4),
  border = "red",
  xlab = "Bill Length (in mm)")

Histogram x-axis limits

If a histogram has some very short bins at high values of x, you can truncate the display using the xlim argument. For example, if you had some data stored in a data frame and you wanted to make a histogram of the column called wingspan:

This doesn’t look great:

hist(
  dat$wingspan,
  main = "Histogram of Wingspan",
  xlab = "wingspan (cm)")

I can truncate the x-values to be between 0 and 10 using xlim.

I’ll also make more bins using the breaks argument.

  • Note that R considers the breaks argument to be a suggestion… Telling it to create 30 bins won’t always result in 30 bins. You can experiment with different numbers of bins until you find a number that works for your plot.
hist(
  dat$wingspan,
  main = "Histogram of Wingspan",
  xlab = "wingspan (cm)",
  xlim = c(0, 10),
  breaks = 30)

Lecture Assignments - ECO 602

Click the links for details about each assignment.

Final Projects/Take-Home Questions

You can find the final project instructions and problems here.

The final problem set consists of two parts:

  • A comprehensive R guide.
  • A complete data analysis

Tips, Tricks, and Walkthroughs

Here are some supplemental links to helpful resources for various topics covered in class. I’ll continually update this list. If you find a resource that was helpful for you, let me know and I’ll add it here!