Objectives

  • Import ginkgo data
  • Perform a basic graphical exploration

Instructions

  • Retrieve the ginkgo data 2022 file and read it into a data.frame object.
  • Inspect the dataset.
    • What columns does it contain?
    • What are the data types of each column?
  • Should the the site_id column be numeric or Boolean?

Questions

  • Q1 (1 pt.): How many trees are in the dataset?
    • Hint: Try using subset() with the select argument to create a data.frame that contains only the seeds_present and site_id columns. Next check out the unique() function.
  • Q2 (1 pt.): How many trees had seeds?
  • Q3 (1 pt.): Include a conditional boxplot of one of the continuous variables conditioned on the seeds_present column in your report.
  • Q4 (1 pt.): Based on your boxplot, do you think there is any difference betwen seed bearing and non seed bearing trees? Note: this is just a preliminary data exploration, you may change your mind based on further analysis!
  • Q5 (1 pt.): Create a scatterplot of max leaf depth (x) and max leaf width (y).
  • Q6 (1 pt.): Qualitatively describe the patterns you see in the scatterplot.
  • Q7 (1 pt.): Explain how our data collection procedure might have violated the fixed x assumption.
  • Q8 (1 pt.): Name 1 or more concepts you’d like me to review or discuss in more detail.

Optional Supplement: ggplot2

So far, we’ve only used plotting functions from base R. There are several other plotting paradigms developed for R including grid, lattice, and ggplot.

“ggplot” stands for the grammar of graphics. The grammar is a systematic way of thinking about visualizing data. The grammar is implemented in R in the ggplot2 package.

We won’t have time to go into art and science of graphing with ggplot2. Instead, I’ll provide a couple of examples that you can tinker with.

The syntax for graphics with ggplot2 is fundamentally different than for base R plots.

One thing you’ll notice in the following examples is the idiosyncratic use of the plus symbol. In ggplot2, the addition operator has been overloaded so that it can be used to combine graphical layers together.

Install the package

You’ll probably need to install package ggplot2. Remember that you can use the install.packages() function to install a new package.

A simple scatterplot

require(ggplot2)
dat = read.csv(here("data", "ginkgo_data_2022.csv"))
names(dat)
## [1] "site_id"        "seeds_present"  "max_width"     
## [4] "max_depth"      "notch_depth"    "petiole_length"
ggplot(dat, aes(x = max_width, y = notch_depth)) +
  geom_point() +
  xlab("Max Leaf Width (mm)") +
  ylab("Notch Depth (mm)")

Notice the call to aes(). This function sets the aesthetics of the plot. It’s is the part of the code that maps variables in the data.frame to the x- and y- axes (for scatterplots).

Try editing the code above to make a scatterplot of max_width on the x-axis and max_depth on the y-axis.

Click to show/hide hint
ggplot(dat, aes(x = max_width, y = max_depth)) +
  geom_point() +
  xlab("Max Leaf Width (mm)") +
  ylab("Max Leaf Depth (mm)")

Adding colors

One of the greatest strengths of plotting with ggplot is that it is easy to group observations by a factor variable in order to display them using different colors or symbols.

Grouping by seeds present

To add a variable by which to color the points, we can use the colour argument to aes(). Let’s try the seeds_present column:

ggplot(dat, aes(x = max_width, y = notch_depth, colour = seeds_present)) +
  geom_point() +
  xlab("Max Leaf Width (mm)") +
  ylab("Notch Depth (mm)")

Try to recreate the following plot:

Grouping by site ID

I can also make a scatterplot that color-codes for individual tree ID.

  • Note that I used colour = factor(site_id) so that site_id was treated as a factor rather than a numeric column.

Resources

These examples just scratch the surface of the power of the ggplot2 package. As you continue to build your R skills, I encourage you to check out some of the abundant resources for learning ggplot2!

For example, you might check out the Introduction to Data Visualization with ggplot2 course on Datacamp.