head()
mean()
sd()
plot()
points()
curve()
data()
Before class, read through the assignment.
In class as a group, work through the
Select a group member to compile your plots into your report.
Everyone in your group should run the code on their own computer.
Please work with your group members to work through any difficulties!
Active collaboration is the key to success with the in-class activities!
R comes with lots of built-in data sets for reference and learning.
Today, we will use some of the built-in data sets for some simple data exploration.
The iris data set is a classic set of measurements on floral characteristics of three Iris species.
Load the dataset into R’s memory using:
data(iris)
Take a look at the top 6 rows:
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
You know from DataCamp that you can use the []
and
$
operators retrieve subsets of data frames.
To get the entries in the “Sepal.Width” column you can type:
iris$Sepal.Width
## [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5
## [19] 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2
## [37] 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3
## [55] 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8
## [73] 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5
## [91] 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9
## [109] 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2
## [127] 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2
## [145] 3.3 3.0 2.5 3.0 3.4 3.0
Use the functions mean()
and sd()
to
calculate the mean and standard deviation of vectors:
mean(iris$Sepal.Length)
## [1] 5.843333
sd(iris$Sepal.Width)
## [1] 0.4358663
R’s lowly base plot()
function has a lot of uses. We’ll
just practice the basics. To build a scatterplot of iris sepal width and
length:
plot(x = iris$Sepal.Width, y = iris$Sepal.Length)
Where is the center of the data?
In this case, let’s think of center as the point whose x-coordinate is the mean of iris sepal length, and y-coordinate is the mean of sepal width.
data_center_x = mean(iris$Sepal.Width)
data_center_y = mean(iris$Sepal.Length)
c(data_center_x, data_center_y)
## [1] 3.057333 5.843333
Use points()
to add to an existing plot:
plot(x = iris$Sepal.Width, y = iris$Sepal.Length)
points(x = data_center_x, y = data_center_y, col = "red")
Maybe that one wasn’t so great. It should probably pass throught he data’s center point:
How did I do that?
NOTE: This is a preview of more advanced R techniques that you’ll use later. For now, we won’t worry about the details of how it works.
One of the most powerful reasons to use R is that you can write your own functions to accomplish tasks that you need to do, but that would otherwise be difficult or tedious to accomplish with existing R functions.
I wrote a simple function to calculate the coordinates of points on a line given the x- and y- coordinates of a known point and a slope.
The code looks like:
line_point_slope = function(x, x1, y1, slope)
{
get_y_intercept =
function(x1, y1, slope)
return(-(x1 * slope) + y1)
linear =
function(x, yint, slope)
return(yint + x * slope)
return(linear(x, get_y_intercept(x1, y1, slope), slope))
}
If I paste the code for the function definition in the R console, I
can use the function just like any other one I already know, like
c()
or mean()
.
Try running the code in your R session then running the function with the arguments: 2, 4, 4, -2
Your RStudio session should look something like this:
Now I can use this function to calculate the y-value along any line for which I know the slope and the coordinates of a single point.
(This will seem more impressive and useful when we talk about deterministic functions…)
You added a point to an existing plot using
points()
.
You can use curve()
to draw a line (or other curve).
Here’s the syntax to re-create the iris plot with a line passing through the center:
plot(x = iris$Sepal.Width, y = iris$Sepal.Length)
points(x = data_center_x, y = data_center_y, col = "red")
curve(
line_point_slope(
x,
data_center_x,
data_center_y,
-0.1),
add = TRUE)
Note that the four arguments to line_point_slope()
are:
Try to modify the code I ran above to re-draw the plot with a different slope. Something like this:
Self check:
data_center_x
and
data_center_y
variables come from?Try to re-use and modify the code above to work with other datasets. Experiment with plotting other lines to graphically describe the data.
CO2
datasetMASS
package and try out the
Animials
dataset:library(MASS)
data(Animals)
head(Animals)
## body brain
## Mountain beaver 1.35 8.1
## Cow 465.00 423.0
## Grey wolf 36.33 119.5
## Goat 27.66 115.0
## Guinea pig 1.04 5.5
## Dipliodocus 11700.00 50.0
Submit a document containing a scatterplot from each member of your in-class group. Each scatterplot should have the group member’s name in the title.