Learning objectives and technical skills

  • Working with working directories
  • Importing data from the internet and local files
  • Data manipulation: merging data frames
  • Graphical and numerical data exploration
  • Installing R packages

Introduction

  • This assignment will build upon previous in-class activities.

Overview

In this assignment you’ll:

  1. Install the here package.
  2. Download and read data files into R.
  3. Perform an exploratory data analysis.

Installing R packages

R is very extensible. That is one of its greatest strengths! There are hundreds of R packages that contain functions for performing analyses and creating graphics beyond what is included in the base R.

I’ll walk through the process of installing and loading an R package in the following sections.

I’m going to use the here packages as an example.

The here package is designed to make file import/export easier. It works in conjunction with an RProject.

The install.packages() function

Most R packages can be installed with the install.packages() function.

The syntax for basic usage is simple: just type in the name of the package you want to install (in quotes). By default install.packages() searches the CRAN repositories for a matching package.

To install here you can just type:

install.packages("here")

Advanced package installation

  • There are a lot of options for installing packages. You should check out the help entry for install.packages() to learn about the arguments.

  • You can also install packages from other repositories, including bio conductor, and GitHub.

  • Some packages come as pre-complied binaries while some others must be compiled from source code. If you aren’t sure what these terms mean, don’t worry. R will let you know if you have to install a package from source, and you’ll be prompted to install the necessary packages and other tools.

Installing a package

The syntax to install here is simple:

install.packages("here")

Depending on which operating system you are using, R and RStudio versions you have, and the packages you already have installed, you may get a a message about installing dependencies. You should click ‘ok’ to install any of the additional packages that here might need.

You may also get a popup asking you to choose a repository. You can select the cloud option.

If R is able to install the package successfully, you’ll see a message in the console that looks something like this:

package ‘here’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in C:_packages

Loading a package

When R first starts, it loads functions and data from the base packages. These objects are always available.

R does not load the extra packages you may have installed by default, so you need to tell it you want to use them!

There are two functions to accomplish this:

  1. library()
  2. require()

Both of these functions will load a package into memory, making it directly available to you.

The difference is if a package is already loaded, library() will re-load it, while require() will check first. If the package is already loaded, require() will not re-load it.

This difference isn’t usually important and it’s up to you to choose which method you want to use.

I prefer to use require() because some packages take a long time to load. If you plan to run a script file many times, it can save a lot of time if you only load packages once.

On the other hand, if you have updated a package while you are using R and you need to load the updated version, then library() is the way to go.

Importing data from files can be a pain. Believe it or not, for any project a lot of time is spent on data import and cleaning/screening.

Any tools we can use to speed up the data import process are very helpful!

Working directories: the here package to the rescue!

Some background:

  • R utilizes the concept of a working directory to locate file resources.
  • The working directory is simply the directory where R will look for files.
  • You can change the working directory at any time using the function setwd(). You should almost never use the setwd() function.

However changing your working directory means that R will look in the new location for files. If you had previously been working with a file called my_data.csv that was in your working directory, but you changed working directories R will no longer be able to find my_data.csv.

🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨

Relying on setwd() in your R code is an indicator you may be veering into code smell territory.

🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨 🦨

Understanding the working directory and locating data files are two of the biggest sources of frustration for new R users.

We’ll try to save some time and headaches by using RProjects and the package here.

Whenever find yourself using the setwd() function, it should raise red flags.

You should seriously consider using here() instead.

About the the here package

The package here is designed to work with RProjects.

One of the most frustrating aspects for new users of R is understanding the concept of a working directory and importing data files.

Here simplifies these tasks when used within an RProject.

Using here()

For this example, I’m going to assume:

  • You are working with RStudio and that you have an RProject loaded.
  • You have a subdirectory of your main RProject directory called data.
  • There is a file called my_data.csv containing data in the comma separated value format within your data subfolder.

To follow along with the example, you can download the data file from the Data Files section of the ECO 602 page.

The function here() returns the absolute path to the base directory of your RProject.

For example, on my computer when I’m working in the RProject for the ECo 602 course, here produces:

here()
## [1] "C:/git/environmental_data"

Note that here() will always point to this directory, even if my working directory is set to a different location. For example, I might have set my working directory to assignments using getwd():

getwd()
here()
## [1] "C:/git/environmental_data/assignments"
## [1] "C:/git/environmental_data"

Opening a file with here()

Here’s why here() is so useful.

Recall that my file is located in the data subdirectory of my RProject folder.

If my working directory were set to the main RProject directory I could just type:

read.csv("data/my_data.csv")

But we know my working directory is pointed to a different folder so I get the following:

read.csv("data/my_data.csv")
## Error: <text>:1:1: unexpected INCOMPLETE_STRING
## 1: 'Error in file(file, "rt") : cannot open the connection
##     ^

Here is here() to the rescue:

read.csv(here("data", "my_data.csv"))
##   basin sub sta
## 1     D  AL   1
## 2     D  AL   2
## 3     D  AL   3

Basic here() syntax.

You’ll notice I typed:

read.csv(here("data", "my_data.csv"))

When you call here() you should include the subdirectories (in the correct order) and filename as character values (i.e. with quotations marks). The function will assemble the arguments into an absolute path to the file:

here("data", "my_data.csv")
## [1] "C:/git/environmental_data/data/my_data.csv"

NOTE: if your file is located several subdirectories in, you have to list the directory names in the order in which they are nested.

For example if my data file were located within a subdirectory of data called data_sets I would type

here("data", "data_sets", "my_data.csv")
## [1] "C:/git/environmental_data/data/data_sets/my_data.csv"
  • You can consult the help entry for here() for a more detailed description.

A reality check: file.exists()

here() is not foolproof. If you don’t tell it the correct subdirectory or filename to search for, it won’t find your file!

An easy way to tell whether you are looking in the right spot for your file is the function file.exists():

file.exists(here("data", "data_sets", "my_data.csv"))
## [1] FALSE
## [1] FALSE

Oops, I forgot that my data file is one directory back in the data folder:

file.exists(here("data", "my_data.csv"))
## [1] TRUE

And I’m good to go!

Assignment Data

Now that you know all about using here(), you’re ready to work with the assignment data.

Download the data files.

You will be working with the bird census habitat data for this assignment. Download the data file and save them to the data sub directory of your main ECo 602 repository directory. You can find the file ‘hab.sta.csv’ in Assignment Data Files in the Course Materials section of the class GitHub page.

The metadata is in the file ‘birds_metadata.pdf’, which describes the data and decodes the names of the columns.

Load the data into R

You have saved the data file to a sub directory called data and you now know how to use the here() function to make finding it easy.

Use here() and read.csv() to read hab.sta.csv into a data.frame called dat_habitat.

Sample site characteristics

Let’s focus on the terrain variables at the sampling locations:

  • elevation
  • slope
  • aspect

and the tree cover, as measured by basal area.

Histograms

Examine histograms of the three terrain variables.

This is how my basic histogram of slope looks:

Scatterplots

Next, create scatterplots of the three terrain variables (on the x-axis) and basal area (on the y axis).

Hint: use the plot() function to make scatterplots.

  • Check out the main, xlab, and ylab arguments to plot() to customize your scatterplots.

Here’s my plot of basal are and slope:

Fitting linear functions

Recall the visual estimation of linear models from the in-class activity. Here is the code again to visually parameterize a linear function. Try to estimate linear function parameters using your scatterplots. Add the lines to your scatterplots to judge the fit visually.

Here are the linear parameterization functions again:

# Calculates the value of y for a linear function, given the coordinates
# of a known point (x1, y1) and the slope of the line.
line_point_slope = function(x, x1, y1, slope)
{
  get_y_intercept = 
    function(x1, y1, slope) 
      return(-(x1 * slope) + y1)
  
  linear = 
    function(x, yint, slope) 
      return(yint + x * slope)
  
  return(linear(x, get_y_intercept(x1, y1, slope), slope))
}

Recall how we used them on the Iris data? You could probably fit a better line than the one I show below.

plot(
  x = iris$Petal.Length, 
  y = iris$Petal.Width,
  xlab = "Petal Length",
  ylab = "Petal Width",
  main = "Visually-estimated linear model fit\nIris petal length and width"
)
curve(line_point_slope(x, x1 = 3.5, y1 = 1.25, slope = 0.4), add = TRUE)

You should review the week 2 in-class activity instructions if you need a refresher.

Instructions

  1. Plot histograms of the following terrain variables:
  • elevation
  • aspect
  • slope

You’ll need these for the assignment questions.

  1. Create scatterplots of total basal area and the terrain variables (consult the metadata file to see which column(s) you need). Basal area should be on the y-axis.
  • Visually inspect the plots and fit a linear function to each of the scatterplots using the parameterization functions provided above.
  • You’ll need this fitted model for the assignment questions.
  1. Answer the assignment questions and submit your final answers in a report (preferably pdf or html) via Moodle.

Assignment Questions

Q1 Terrain Histograms

Instructions:

  1. Create histograms for the three terrain variables: elevation, slope, and aspect.
  2. Plot all three histograms in one figure and include it in your report.
  • You might want to skip ahead and read the read the terrain/basal area scatterplots question below for an idea of how to organize your plots.

  • Hint: you can use par(mfrow = c(3, 1)) to create a figure with three panels arranged in a single column.

  • Hint: par(mfrow = c(1, 3)) will create a figure with three panels arranged in a single row.

  • Hint: Choose dimensions for your output file so that the individual histograms have an appropriate aspect ratio.

  • Hint: You may notice some peculiarities with the rightmost bin in the aspect histogram. Consider the units in which aspect is measured and check out the breaks argument for hist().

Q2 Elevation Histogram Interpretation

Consider the distribution of elevations at the bird census sample sites.

  • Interpret the shape of the elevation histogram in non-technical language that a non-scientist audience would understand. Some points to consider:
  • Are there more high- or low-elevation sampling sites?
  • Is there an even distribution of sampling site elevation?

Your answer should be 1-2 short paragraphs in length.

Q3 Slope Units

What are the units of slope in this data set?

Hint: Properly curated data often has associated ____data…

Q4 Slope Histogram Interpretation

Consider the distribution of slopes at the bird census sample sites.

  • Interpret the shape of the slope histogram in non-technical language that a non-scientist audience would understand. Some points to consider:
  • Are most sample sites flat?
  • Is there an even mixture of steep and shallow slopes?

Your answer should be 1-2 short paragraphs in length.

Q5 Aspect

  • Briefly define aspect, describing the units used in this dataset.

Q6 Aspect Histogram Interpretation

Consider the distribution of aspect at the bird census sample sites.

  • Interpret the shape of the aspect histogram in non-technical language that a non-scientist audience would understand. Some points to consider:
  • Do the sampling sites tend to be on north-facing slopes?
  • South-facing?
  • Evenly distributed?

Your answer should be 1-2 short paragraphs in length.

Q7 Terrain/Basal Area Linear Models

Instructions:

  1. Create scatterplots of total basal area and each of the the terrain variables: elevation, slope, and aspect.
  • Basal area should be on the y-axis.
  1. Visually inspect the plots and fit a linear function to each terrain variable.
  • Review the linear model parameterization section of the assignment walkthrough if needed.
Click to show/hide the line_point_slope() function code
# Calculates the value of y for a linear function, given the coordinates
# of a known point (x1, y1) and the slope of the line.
line_point_slope = function(x, x1, y1, slope)
{
  get_y_intercept = 
    function(x1, y1, slope) 
      return(-(x1 * slope) + y1)
  
  linear = 
    function(x, yint, slope) 
      return(yint + x * slope)
  
  return(linear(x, get_y_intercept(x1, y1, slope), slope))
}

A plot with three panels in a single row, or in a single column can be awkwardly long or tall. What if you combined your histograms and scatterplots into a larger figure with 6 panels?

  • Hint: you can use par(mfrow = c(3, 1)) to create a figure with three panels arranged in a single column.
  • Hint: par(mfrow = c(1, 3)) will create a figure with three panels arranged in a single row.
  • Hint: Choose dimensions for your output file so that the individual scatterplots have an appropriate aspect ratio.
  • Hint: There are many points, some of which partially overlap, in the scatterplots. You may want to experiment with the size of individual points to see if you can find an optimal size that allows you to see all of them. Try different values of the cex argument in your plot() call.
  • Hint: Choose a different color for your lines. Experiment with the col argument to plot().

Q8 Terrain/Basal Model Interpretation

For each terrain variable (elevation, slope, aspect), describe the relationship you observe and your model fit. You should consider

  • Is there a noticeable association?
  • If so, is it linear?
  • Based on a visual assessment, is your linear model a good fit for the data, why or why not?

Report

Compile your answers to all 8 questions into a pdf document and submit via Moodle.

  • You may also do your work in an R Notebook and submit a rendered html file.