Objectives

  • Practice R coding
  • Numerical and graphical data exploration
  • Very basic R plotting
  • R functions:
    • head()
    • mean()
    • sd()
    • plot()
    • points()
    • curve()
    • data()

Instructions

Before class, read through the assignment.

In class as a group, work through the

Select a group member to compile your plots into your report.

Everyone in your group should run the code on their own computer.

Please work with your group members to work through any difficulties!

Active collaboration is the key to success with the in-class activities!

R’s built-in data sets

R comes with lots of built-in data sets for reference and learning.

Today, we will use some of the built-in data sets for some simple data exploration.

The ‘iris’ dataset

The iris data set is a classic set of measurements on floral characteristics of three Iris species.

Load the dataset into R’s memory using:

data(iris)

Take a look at the top 6 rows:

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Calculate the mean and standard deviation of columns

You know from DataCamp that you can use the [] and $ operators retrieve subsets of data frames.

To get the entries in the “Sepal.Width” column you can type:

iris$Sepal.Width
##   [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5
##  [19] 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2
##  [37] 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3
##  [55] 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8
##  [73] 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5
##  [91] 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9
## [109] 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2
## [127] 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2
## [145] 3.3 3.0 2.5 3.0 3.4 3.0

Use the functions mean() and sd() to calculate the mean and standard deviation of vectors:

mean(iris$Sepal.Length)
## [1] 5.843333
sd(iris$Sepal.Width)
## [1] 0.4358663

Very simple scatterplot

R’s lowly base plot() function has a lot of uses. We’ll just practice the basics. To build a scatterplot of iris sepal width and length:

plot(x = iris$Sepal.Width, y = iris$Sepal.Length)

Where is the center of the data?

In this case, let’s think of center as the point whose x-coordinate is the mean of iris sepal length, and y-coordinate is the mean of sepal width.

data_center_x = mean(iris$Sepal.Width)
data_center_y = mean(iris$Sepal.Length)
c(data_center_x, data_center_y)
## [1] 3.057333 5.843333

Add a point to an existing plot:

Use points() to add to an existing plot:

plot(x = iris$Sepal.Width, y = iris$Sepal.Length)
points(x = data_center_x, y = data_center_y, col = "red")

Can you draw a line that describes the data?

Maybe that one wasn’t so great. It should probably pass throught he data’s center point:

How did I do that?

  • I created a custom function to plot a line through a given point.

Loading a custom function into R’s workspace

NOTE: This is a preview of more advanced R techniques that you’ll use later. For now, we won’t worry about the details of how it works.

One of the most powerful reasons to use R is that you can write your own functions to accomplish tasks that you need to do, but that would otherwise be difficult or tedious to accomplish with existing R functions.

I wrote a simple function to calculate the coordinates of points on a line given the x- and y- coordinates of a known point and a slope.

The code looks like:

line_point_slope = function(x, x1, y1, slope)
{
  get_y_intercept = 
    function(x1, y1, slope) 
      return(-(x1 * slope) + y1)
  
  linear = 
    function(x, yint, slope) 
      return(yint + x * slope)
  
  return(linear(x, get_y_intercept(x1, y1, slope), slope))
}

If I paste the code for the function definition in the R console, I can use the function just like any other one I already know, like c() or mean().

Try running the code in your R session then running the function with the arguments: 2, 4, 4, -2

Your RStudio session should look something like this:

Now I can use this function to calculate the y-value along any line for which I know the slope and the coordinates of a single point.

(This will seem more impressive and useful when we talk about deterministic functions…)

Add a curve to an existing plot

You added a point to an existing plot using points().

You can use curve() to draw a line (or other curve).

Here’s the syntax to re-create the iris plot with a line passing through the center:

plot(x = iris$Sepal.Width, y = iris$Sepal.Length)
points(x = data_center_x, y = data_center_y, col = "red")
curve(
  line_point_slope(
    x, 
    data_center_x, 
    data_center_y,
    -0.1), 
  add = TRUE)

Note that the four arguments to line_point_slope() are:

  1. the x-value for which to calculate the y-value output
  2. the x-coordinate of the known point along the line.
  3. the y-coordinate of the known point along the line.
  4. the slope of the line.

Try to modify the code I ran above to re-draw the plot with a different slope. Something like this:

Self check:

  • Where did the data_center_x and data_center_y variables come from?
  • Why do you think I used these variables rather than hard coded numeric constants?
  • What part of the code added the descriptive title to the plot?

Try other variables or datasets.

Try to re-use and modify the code above to work with other datasets. Experiment with plotting other lines to graphically describe the data.

  • dataset Iris contains lots of other floral characteristics.
  • Try the CO2 dataset
  • Load the MASS package and try out the Animials dataset:
library(MASS)
data(Animals)
head(Animals)
##                     body brain
## Mountain beaver     1.35   8.1
## Cow               465.00 423.0
## Grey wolf          36.33 119.5
## Goat               27.66 115.0
## Guinea pig          1.04   5.5
## Dipliodocus     11700.00  50.0

Instructions

  1. Self-select your group
  2. One person create a Word document on OneDrive and share it with the other group members (giving them edit privileges).
  3. Proceed as far as you can in the exercise during class, ideally through the line-fitting.
  4. Actively work together to keep everybody at the same pace - I should hear lots of talking and see a lot of looking at each others’ computers!
  5. Everybody should try out different values for the slope and intercept (it doesn’t matter if the lines fit the data well - for now).
  6. Practice making descriptive titles.
  7. Each group member should contribute a scatterplot with a line and a customized title containing your name to the report document.

Report

Submit a document containing a scatterplot from each member of your in-class group. Each scatterplot should have the group member’s name in the title.

  • In addition to the scatterplots, include a description of which parts of the exercise were difficult, or too easy.

Rubric

  • (4 pts) A scatterplot from each group member.
  • (1 pt) Description of what was easy and difficult.