head()
mean()
sd()
plot()
points()
curve()
data()
As a group, work through the following sections.
Everyone in your group should run the code on their own computer. Please work with your group members to work through any difficulties!
R comes with lots of built-in data sets for reference and learning.
Today, we will use some of the built-in data sets for some simple data exploration.
The iris data set is a classic set of measurements on floral characteristics of three Iris species.
Load the dataset into R’s memory using:
data(iris)
Take a look at the top 6 rows:
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
You know from DataCamp that you can use the []
and
$
operators retrieve subsets of data frames.
To get the entries in the “Sepal.Width” column you can type:
iris$Sepal.Width
## [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5
## [19] 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2
## [37] 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3
## [55] 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8
## [73] 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5
## [91] 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9
## [109] 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2
## [127] 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2
## [145] 3.3 3.0 2.5 3.0 3.4 3.0
Use the functions mean()
and sd()
to
calculate the mean and standard deviation of vectors:
mean(iris$Sepal.Length)
## [1] 5.843333
sd(iris$Sepal.Width)
## [1] 0.4358663
R’s lowly base plot()
function has a lot of uses. We’ll
just practice the basics. To build a scatterplot of iris sepal width and
length:
plot(x = iris$Sepal.Width, y = iris$Sepal.Length)
Where is the center of the data?
In this case, let’s think of center as the point whose x-coordinate is the mean of iris sepal length, and y-coordinate is the mean of sepal width.
data_center_x = mean(iris$Sepal.Width)
data_center_y = mean(iris$Sepal.Length)
c(data_center_x, data_center_y)
## [1] 3.057333 5.843333
Use points()
to add to an existing plot:
plot(x = iris$Sepal.Width, y = iris$Sepal.Length)
points(x = data_center_x, y = data_center_y, col = "red")
Maybe that one wasn’t so great. It should probably pass throught he data’s center point:
How did I do that?
NOTE: This is a preview of more advanced R techniques that you’ll use later. For now, we won’t worry about the details of how it works.
One of the most powerful reasons to use R is that you can write your own functions to accomplish tasks that you need to do, but that would otherwise be difficult or tedious to accomplish with existing R functions.
I wrote a simple function to calculate the coordinates of points on a line given the x- and y- coordinates of a known point and a slope.
The code looks like:
line_point_slope = function(x, x1, y1, slope)
{
get_y_intercept =
function(x1, y1, slope)
return(-(x1 * slope) + y1)
linear =
function(x, yint, slope)
return(yint + x * slope)
return(linear(x, get_y_intercept(x1, y1, slope), slope))
}
If I paste the code for the function definition in the R console, I
can use the function just like any other one I already know, like
c()
or mean()
.
Try running the code in your R session then running the function with the arguments: 2, 4, 4, -2
Your RStudio session should look something like this:
Now I can use this function to calculate the y-value along any line for which I know the slope and the coordinates of a single point.
(This will seem more impressive and useful when we talk about deterministic functions…)
You added a point to an existing plot using
points()
.
You can use curve()
to draw a line (or other curve).
Here’s the syntax to re-create the iris plot with a line passing through the center:
plot(x = iris$Sepal.Width, y = iris$Sepal.Length)
points(x = data_center_x, y = data_center_y, col = "red")
curve(
line_point_slope(
x,
data_center_x,
data_center_y,
-0.1),
add = TRUE)
Note that the four arguments to line_point_slope()
are:
Try to modify the code I ran above to re-draw the plot with a different slope. Something like this:
Self check:
data_center_x
and
data_center_y
variables come from?Try to re-use and modify the code above to work with other datasets. Experiment with plotting other lines to graphically describe the data.
CO2
datasetMASS
package and try out the
Animials
dataset:library(MASS)
data(Animals)
head(Animals)
## body brain
## Mountain beaver 1.35 8.1
## Cow 465.00 423.0
## Grey wolf 36.33 119.5
## Goat 27.66 115.0
## Guinea pig 1.04 5.5
## Dipliodocus 11700.00 50.0
Submit a document containing a scatterplot from each member of your in-class group. Each scatterplot should have the group member’s name in the title.