Learning Objectives

  • Create a set of R examples for your personal reference beyond this course.
  • Use the concepts you have learned in the course to perform a complete data analysis.

Introduction: R Reference

It’s very easy to forget R programming concepts if you don’t use them frequently.

A great way to retain the skills that will be important for you going forward is to create a set of examples using the functions you want to remember.

You’ll create an R Markdown document containing examples of how to use the basic R components that you need to know.

Introduction: Data Analysis

For this portion of the final project, you’ll perform a data analysis on data collected or two species of small mammals in the Atlantic Forest of Brazil.

You’ll find the data file, delomys.csv, in the data tab of the course GitHub page.

It includes data extracted from a larger data set:

R Markdown Documents

You’ll create a new RMarkdown document for your R reference guide:

  • final_R_reference.Rmd

It needs to be in the docs subfolder (along with your index.Rmd file).

Part 1: R Reference Guide

Document Formatting

You need to create a document that uses tabs to organize the content. Here’s a template:

# R Reference Guide {.tabset .tabset-pills}


## Loading Data and Packages


## Next Section...


...

Creating a successful code example

Self-Contained, concise, reproducible examples

A successful code example should communicate what the function or operator does.

The best examples are:

  • Self-contained:
    • If possible, the example does not rely on external data sets.
    • To the greatest extent possible, the example does not rely on external packages.
      • This isn’t always possible, it’s ok to include external packages if they contain essential data or supporting functions!
  • Concise:
    • The example is as simple and brief as possible.
    • The example focuses on the function or operator in question, avoiding other functions as much as possible.
    • Avoid jargon and technical terminology as much as possible.
  • Reproducible:
    • The example must run on anyone’s computer!

Your code examples need to be written in your own words, in such a way that you will be able to decipher them later.

Code Comments

You should use comments in your code as needed to highlight important points.

Sample Code Example

The following is a code example for the c() function. You may use this example verbatim, however all other examples must be your own.

The function c() combines or concatenates its arguments into a vector (a 1-dimensional data structure consisting of 1 or more elements).

  • All of the elements must be of the same type.
    • I can’t combine character and numeric types in the same call to c()

Here’s two examples using numeric and character data types:

## Create a vector of numbers:
num_vec  = c(1, 4, 8, 9, 13)

## Create a vector of characters:
char_vec = c("a", "fish", "data is cool")

I can show the contents of a vector by typing the name of the vector, or using the print() function.

## Typing the name of the vector into the console prints the contents
num_vec
## [1]  1  4  8  9 13
## The print() function accomplishes the same task:
print(char_vec)
## [1] "a"            "fish"         "data is cool"

Required Functions and Arguments

All functions are denoted with parentheses.

The required function arguments are contained in indented bullet points below the corresponding function.

Loading Data and Packages

Use these to show how to load the here and palmerpenguins packages

  • libary() and require()

Ginkgo data: use the 2021 ginkgo data to create a data.frame called ginkgo using:

  • here()
  • read.csv()

Data Structures

  • c()
  • length()
  • matrix()
  • data.frame()

Use the ginkgo data.frame to create examples of:

  • nrow()
  • ncol()
  • dim()

Subsetting

Use the ginkgo data for these examples:

  • $ Subset a data frame by name: select one of the columns in the ginkgo data
  • [] Use subset by position to:
    • select first row of the ginkgo data
    • select the element in row 2, column 3
    • select the 3rd column of the ginkgo data
  • subset() Use this function to retrieve all the data for Adelie penguins (in the species column) from the peuguins dataset.

Numerical Data Exploration

You may use the ginkgo or Palmer penguin data to create examples of:

  • summary()
  • mean()
  • sd()

Graphical Data Exploration

Scatterplot: Using the ginkgo data, reate a scatterplot of max leaf depth (x) and max leaf width (y).

  • plot() required arguments:

    • col =
    • pch =
    • cex =
    • main =
    • xlab =
    • ylab =
    • xlim =
    • ylim =
  • hist() Create a histogram of penguin flipper lengths. Required arguments:

    • breaks =
  • boxplot()

    • Your must include two examples using the ginkgo data:
      1. a simple boxplot of ginkgo petiole lengths
      2. conditional boxplot of one of the continuous variables conditioned on the seeds_present column.
  • Create a 4-panel figure of histograms, arranged in a 2 by 2 grid. You may use any data you like, but each histogram must be different and have appropriate titles and axes.

  • par() required arguments:

    • mfrow =

Distribution Functions

  • dnorm()

  • pnorm()

  • qnorm()

  • dbinom()

  • pbinom()

  • qbinom()

Part 2: Data Analysis

You’ll perform a complete data analysis on the Delomys species data. You can do your work in a RMarkdown document, or an R script (RMarkdown preferred).

Data Exploration

Numerical Exploration

Create a code chunk that includes the following:

  • Use summary() on the body mass and body length data columns in the Delomys data set to display summary statistics.

  • Perform a test of normality on the body mass and length columns. You can use shapiro.test()

Graphical Exploration

You can adjust the size of the plots on your rendered document using the following code chunk arguments:

fig.height= fig.width=

You can adjust the aspect ratio using fig.aspect=

Using the penguins data as an example, here’s an example code chunk using the fig.width option.

```{r fig.width=10}
require(palmerpenguins)
plot(bill_length_mm ~ body_mass_g, data = penguins)
```

Producing the following output:

require(palmerpenguins)
plot(bill_length_mm ~ body_mass_g, data = penguins)

  • Try different values for the chunk options to get a feel for how they affect the plots.

You will need to experiment with different width, height, and/or aspect values for each of your figures.

Using code chunks, create the following plots, which you’ll use to answer the report questions:

  • A scatterplot of body mass and body length
  • A histogram of body mass
  • A histogram of body length
  • A conditional boxplot of body mass, conditioned on species (column binomial)
  • A conditional boxplot of body mass, conditioned on sex (column sex)
  • A conditional boxplot of body mass, conditioned on both species and sex

Q1-4: Data Exploration

Answer the following in your report:

  • Q1 (2 pts.): Qualitatively describe the relationship between body mass and length.
    • Does the relationship seem linear, curved, nonexistent?
  • Q2 (2 pts.): Qualitatively describe the shapes of the histograms.
    • Do the data appear normally-distributed? Explain why or why not.
    • Explain why we care (or not) whether the data are normally distributed.
  • Q3 (2 pts.): Using both the histograms and normality tests, do you think the (unconditioned) body masses and body length are normally-distributed?
    • Make sure you contrast your visual assessment of normality to the results of the numerical normality tests.
  • Q4 (2 pts.): Examine the three conditional boxplots.
    • Describe any graphical evidence you see for body mass differences based on species and/or sex.

Model Building

We know that the normality assumption applies to the residual values after we fit a model.

Using a code chunk, fit 5 models using lm():

  • Model 1: simple linear regression body_length ~ body_mass
  • Model 2: 1-way ANOVA body_mass ~ sex
  • Model 3: 1-way ANOVA body_mass ~ binomial
  • Model 4: 2-way additive ANOVA body_mass ~ sex + binomial
  • Model 5: 2-way factorial ANOVA body_mass ~ sex * binomial
  • The first model predicts body length as a function of body mass
  • The other models use the categorical variables binomial and sex to predict body mass.

Save your model objects to variables called fit1, fit2, fit3, fit4, fit5.

Model Diagnostics

Let’s check whether our models fulfill the assumption of normality of the residuals.

First, use a graphical approach: plot histograms of the model residuals.

  • You can retrieve the model residuals using the residuals() function. For example, I could get the residuals from the first model using residuals(fit1).

Use a code chunk to create histograms of the residuals of each of the 5 models.

Next, use shapiro.test() on each model to test the null hypothesis that the residuals are drawn from a normally-distributed population.

Q5-6: Model Assumptions

Answer the following in your report:

  • Q5 (2 pts.): What do you conclude about residual normality based on the numerical and graphical diagnostics?
  • Q6 (1 pt.): Are violations of the normality assumption equally severe for all the models?

Model Interpretation

You can use the following code within a code chunk to print out a nicely formatted model coefficient table:

knitr::kable(coef(summary(my_model_fit)))

where my_model_fit is the name of your fitted model object.

You can use similar syntax to print a nicely formatted ANOVA table: knitr::kable(anova(my_model_fit))

  • Check out the digits argument to control how many decimal digits are printed.

Q 7-9: Simple Linear Regression

Print the model coefficient table using summary() and answer the following:

  • Q7 (2 pts.): What is the magnitude of the mass/length relationship?
  • Q8 (2 pts.): What is the expected body length of an animal that weighs 100g?
  • Q9 (2 pts.): What is the expected body length of an animal that weighs 0g?

Q 10-13: Body Mass: Coefficient Tables

Print the model coefficient tables for each of the body mass model fits.

Answer the following:

  • Q10 (1 pt.): What is the base level for sex?
  • Q11 (1 pt.): What is the base level for binomial?
  • Q12 (1 pt.): Which sex is heavier? How do you know?
  • Q13 (1 pt.): Which species is heavier? How do you know?

Q 14-16: ANOVA Tables

Print the ANOVA tables for each of the body mass models.

Answer the following in your report:

  • Q14 (1 pt.): Are sex and species significant predictors for body mass?
  • Q15 (1 pt.): Is there a significant interaction?
  • Q16 (2 pts.): Examine the p-values for the main effects (sex and species) in all four of the ANOVA tables. Does the significance level of either main effect change very much among the different models?

Model Comparison: Body Mass

You built four different models of body mass. How do you choose the best one?

One option is to choose the model with the lowest AIC. You can calculate AIC using the appropriately named AIC() function.

Create a code chunk that calculates the AIC values for each of the body mass models.

Q17-18: Model Comparison

  • Q17 (1 pt.): Which two models have the lowest AIC?
  • Q18 (4 pts.): Which of the two models with lowest AIC scores would you select?
    • Explain your decision based on model fit and the complexity/understanding tradeoff.

Data Analysis Report

Compile your answers to the 18 questions and submit them as a pdf or html document in Moodle.

Your final draft report should include only your figures and answers to the questions. Do not include any extraneous R code that you may have used in your rough draft. I do not need to see the code you used to read the data.

If you are using RMarkdown, you may add the code chunk options echo=FALSE and results='hide' to suppress the printing of any R code or output you wish to hide.