The reading for this lab is long and winding, but it covers some important concepts you’ll need to know to be a wise R coder.
You’ll need to become very familiar with logical tests and operators in R - a field also known as Boolean algebra.
The symbols for the most common logical tests we use in R are:
==
>
and
<
>=
and
<=
!=
Some important logical operators are:
!
TRUE
to a FALSE
and vice versa.&
TRUE
only if both elements
are TRUE
.|
TRUE
if at least one of the
elements is TRUE
.FALSE
when both of the test
elements are FALSE
.I encourage you to play with these operators and tests to get an intuitive feel for what they do.
There are two related operators: &&
and
||
. These evaluate only the first element of the objects
they are comparing - they are not ‘vectorized’.
To avoid confusion and unexpected results we will avoid using
&&
and ||
in this course.
To illustrate the difference, we can compare elements in some vectors.
a = c(T, F, F)
b = c(T, T, F)
c = c(T, T, F, T)
a & b
## [1] TRUE FALSE FALSE
This produces an error since a and c are of different lengths.
a & c
## [1] TRUE FALSE FALSE TRUE
a && b
## [1] TRUE
This is valid R code since &&
only compares the
first elements.
a && c
## [1] TRUE
4
?4.0
?1.0
equivalent to TRUE
?3.0000000000000001
the same as
3.0
?Let’s ask R:
class(4)
## [1] "numeric"
class(4.0)
## [1] "numeric"
1.0 == TRUE
## [1] TRUE
3.0000000000000001 == 3.0
## [1] TRUE
Were those the answers you expected?
Let’s see if these make any more sense:
(0 + 1) == TRUE
## [1] TRUE
(0 - 1) == TRUE
## [1] FALSE
Can I multiply or divide by TRUE or FALSE?
3.0 * TRUE
## [1] 3
4 / FALSE
## [1] Inf
FALSE / FALSE
## [1] NaN
3.0 * (TRUE + FALSE)
## [1] 3
3.0 * (TRUE - FALSE)
## [1] 3
3.0 * (FALSE - FALSE)
## [1] 0
My head hurts a little bit after writing all of those!
How might we make sense of those results?
Why am I even asking you to know about such nonsensical things?
All of the above weirdness is related to how R implements the concepts of data typing, coercion, and type casting.
Which brings us back to the idea of high- and low- level programming languages.
The classification of a programming language as low- or high-level is related to the level of abstraction between what the programmer writes and the computer executes.
In other words, if a computer language does a lot of translating into information that a computer can understand, it is a high-level language.
In a high-level language (such as R), you might be able to calculate the sum of the elements of a matrix with a single command:
my_matrix = matrix(data = 1:9, nrow = 3, ncol = 3)
sum(my_matrix)
## [1] 45
Behind the scenes, R knows that it has to look the elements of
my_matrix
and keep a running total as it adds all of the
values together. In this case, there is a high level of
abstraction between what you type:
sum(my_matrix)
and what the computer actually does.
In contrast, if you wanted to do the same task in Java, which is a “moderately high-level” language you might have to write something like this (please don’t worry about trying to understand all of the code):
public class MatrixSumDemo
{
public static int matrix_sum(int[][] input_matrix)
{
int running_total = 0;
for (int i = 0; i < input_matrix[0].length; i++)
for (int j = 0; j < input_matrix.length; j++)
{
running_total += input_matrix[i][j];
}
return running_total;
}
public static void main(String[] args)
{
int[][] my_matrix = new int[3][3];
my_matrix[0] = new int[] {1, 2, 3};
my_matrix[1] = new int[] {4, 5, 6};
my_matrix[2] = new int[] {7, 8, 9};
System.out.println(matrix_sum(my_matrix));
}
}
## [1] 45
You might notice that I didn’t call Java a low-level language! This means that even though you have to spell out more of the steps in Java (compared to R), there are still many layers of abstraction between the Java code and the instructions your computer processor can understand. Imagine if you had to write a program in binary that you could submit directly to your computer’s processor to evaluate…
There are some serious trade offs we should be aware when we use high level languages.
In the R/Java example you might have noticed that the text
int
occurred a lot in the Java code. That’s because Java
requires us to specify exactly what kind of number we want a variable to
hold (int
specifies that a number is an integer).
R tries to guess what kind of number we are using so we didn’t
have to tell it that we wanted our matrix to be filled with integer
values.
Usually that’s ok, but what happens when we try to multiply
TRUE
by 3
?
Java wouldn’t even let create a program in which such an operation were possible! That’s a safeguard against having unpredictable or undefined behavior.
In R, however, TRUE * 3
is perfectly legal, even if it’s
not clear what it means or why we would want to do such a thing.
If a friend asked you what you get when you divide TRUE
by 5, how would you respond?
We’ve mostly encountered r calls that use a single function.
For more complicated or sophisticated tasks, we often have to combine numerous functions.
Suppose I wanted to print the value of a randomly-generated integer.
I could:
int_rnd = sample(100, 1)
int_rnd_sentence = paste0("The value of the randomly-generated number is: ", int_rnd)
print(int_rnd_sentence)
## [1] "The value of the randomly-generated number is: 75"
Or I could nest all of those tasks within a single function without creating the intermediate variables:
print(
paste0(
"The value of the randomly-generated number is: ",
sample(100, 1)))
## [1] "The value of the randomly-generated number is: 46"
Note that sample()
was called within
paste0()
which was called within print()
.
Here is a simple for-loop in R:
for (i in 1:10)
{
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
Key items to note:
for (...
syntax lets R know that it will execute a
for-loop.for
syntax, the i in 1:10
tells
R to execute the loop 10 times, using an index variable
called i
.1:10
. What does this
expression do on its own?in
.{}
print()
to print the value during each pass
through the loop.print(i)
).We’ll look at other kinds of loops later. For now I encourage you to play with this loop skeleton to make sure you understand the syntax.
R also has a family of functions, the apply
functions
that can accomplish the same tasks as a loop.
Some folks prefer to only use the apply
approach, while
others prefer to only use the loop approach.
Loops in R tend to be slow compared to loops in many other languages.
There is also a train of thought that says the apply
approach is more elegant or aesthetically appealing.
My opinion is that whether or not you choose to use or avoid loops in R, you need to know what loops are and how they work. Loops are a fundamental concept in computing, within and beyond the R world.
You used a custom function in the in-class activity on Tuesday.
Here’s a very simple custom function:
print_number = function(n)
{
print(paste0("The value of the number is ", n))
}
print_number(145)
## [1] "The value of the number is 145"
Things to notice:
function()
to define a new
function.function()
become the
arguments to the new function you want to create. In this
example there is only argument: n
.{}
n
within the
body of the function takes on the value that you supplied:print_number(145)
causes the variable
n
within the function to take the value 145.Argument: an input to a function. A function can have zero or more arguments.
In R, arguments can have default values.
Arguments have names.
R expects the arguments to be supplied in the order specified by the function definition. Unless…
You can input the arguments in any order if you specify their names.
Function body: the code that is called inside the function.
The tasks that the function performs are all written within the function body.
The function body is written within curly braces
{}
.
Function bodies can contain many lines of code.
Return value: the value that a function produces.
Functions do not have to have a return value.
The print_number()
function above does not have a
return value.
All of the following calls to rnorm()
are identical.
rnorm(10)
rnorm(n = 10, sd = 1)
rnorm(sd = 1, mean = 0, n = 10)
If you consult the R help entry for rnorm()
you will see
that the 3 arguments are (in order):
n
mean
sd
Both mean
and sd
have default values (0 and
1, respectively.
Check out the R-help entry for rnorm()
by typing
?rnorm
into the console window:
?rnorm
You’ve used logical subsetting to select elements of a matrices and vectors. With small data sets it’s possible to look at all of the elements at once and visually detect the indices of the elements you want. This is not possible with larger data sets.
Run the following code to create a large vector containing randomly generated integers between 1 and 12:
n = 12345
vec_1 = sample(12, n, replace = TRUE)
head(vec_1)
Use a logical test operator to create a Boolean vector (called
vec_2
) whose entries are TRUE
if the
corresponding entry in vec_1
is 3 and FALSE
otherwise.
Self test: you can use vec_2
to retrieve all of the
3
elements of vec_1
using the following:
vec_1[vec_2]
You should see a vector whose elements are all 3
.
vec_2
.Your code should be a complete and self-contained example. I should
be able to paste your code into a fresh R session on my computer and
re-create your vec_2
vec_1
have value 3
by visual
inspection is a bad idea.Run the following code to create a large vector containing randomly generated integers between 1 and 12:
n = 12345
vec_1 = sample(12, n, replace = TRUE)
head(vec_1)
Use the function length()
to determine how many elements
are in vec_1
.
Now, run the following line to check how many entries have the value
3
:
sum(vec_1 == 3)
Finally, run the following code several times taking note of how many
3
entries appear each time you run it.
n = 10
vec_1 = sample(12, n, replace = TRUE)
paste0("Sum of elements with value 3: ", sum(vec_1 == 3))
3
entries each time?3
.You may want to review the for-loop example in the lab walkthrough.
for (i in 1:10)
{
print(i)
}
Modify the code in the body of the loop to print out a message like “This is loop iteration: 1” for each run through the loop.
print()
and [paste()
or
paste0()
] functions.:::{.questions}
You may want to review the for-loop example in the lab walkthrough.
Run the following code on your computer:
for (i in 1:10)
{
print(i)
}
Note that the loop runs through exactly 10 iterations…
What if you wanted the loop to execute an arbitrary number of times?
n
, that contains an integer
value.n
times.n
times. It needs to be a self contained
example. I should be able to set the value of n and then run your loop
on my computer.n
, that holds the value
17
.vec_1
of length
n
. vec_1
should contain [pseudo]randomly
generated integers between 1 and 10.sample()
R function.Now, create a loop that:
n
times (once for each element of
vec_1
).vec_1
Your output should look something like this:
## The element of vec_1 at index 1 is 4.
## The element of vec_1 at index 2 is 10.
## The element of vec_1 at index 3 is 3.
## The element of vec_1 at index 4 is 2.
## The element of vec_1 at index 5 is 2.
## The element of vec_1 at index 6 is 9.
Hint: Think of what code you’ll need to include within the body of
the loop to display: 1. index number 1. The value of vec_1
at the index.
n
, vec_1
, and the loop. As always, it
should run as a stand-alone example in a fresh R session on my
computer.Write a function create_and_print_vec()
.
Your function should take three integer arguments, n
,
min
, and max
.
n
has no default value.min
has a default value of 1.max
has a default value of 10.Your function needs to do the following:
n
random integers between the values
of min
and max
.Here’s a skeleton:
create_and_print_vec = function(n, min = , max =)
{
# Function body goes here
}
sample()
. Look up the R help entry to see what other kinds
of values the x
argument of sample()
can
accept.Your function should create output like this with default values for
min
and max
:
## [1] "The element at index 1 is 3"
## [1] "The element at index 2 is 3"
## [1] "The element at index 3 is 3"
## [1] "The element at index 4 is 3"
## [1] "The element at index 5 is 2"
## [1] "The element at index 6 is 2"
## [1] "The element at index 7 is 2"
## [1] "The element at index 8 is 3"
## [1] "The element at index 9 is 2"
## [1] "The element at index 10 is 2"
## [1] "The element at index 11 is 3"
## [1] "The element at index 12 is 3"
## [1] "The element at index 13 is 3"
## [1] "The element at index 14 is 2"
## [1] "The element at index 15 is 2"
## [1] "The element at index 16 is 3"
## [1] "The element at index 17 is 2"
## [1] "The element at index 18 is 2"
## [1] "The element at index 19 is 3"
## [1] "The element at index 20 is 2"
When you use min = 100
and max = 2000
your
output should resemble:
create_and_print_vec(10, min = 100, max = 2000)
## [1] "The element at index 1 is 1600"
## [1] "The element at index 2 is 289"
## [1] "The element at index 3 is 1818"
## [1] "The element at index 4 is 1999"
## [1] "The element at index 5 is 1963"
## [1] "The element at index 6 is 1643"
## [1] "The element at index 7 is 382"
## [1] "The element at index 8 is 1802"
## [1] "The element at index 9 is 1950"
## [1] "The element at index 10 is 1484"
Compile your answers to all 9 questions into a pdf document and submit via Moodle.
Here is a collection of hints I’ve compiled based on student questions over the years. Perhaps your questions could contribute to new hints!
You may not need these to complete the lab, so don’t peek if you don’t want to!
However, if you are stuck you may find something useful here.
sample()
functionThere is a lot of info in the help entry for sample()
,
but it can be difficult to understand R help entries, especially when
you’re first starting with R!
When sample() has x equal to a single value, it draws random numbers from 1 up to that number.
Let’s try a simple example with 1’s and 2’s:
sample(x = 2, size = 1)
## [1] 1
Try running the code several times and look at the output.
Now, let’s use a variable:
n = 5
sample(x = n, size = 1)
## [1] 5
Let’s say I wanted to create a random collection of 5 numbers, all in the range of 100 to 102:
random_vec = sample(x = 100:102, size = 5, replace = TRUE)
random_vec
## [1] 100 101 102 102 101
I’ll let you experiment using variables for the minimum and maximum values of the range you want to sample from.
min
and max
with
the values 100 and 102 to produce the same results as the snippet
above.Here’s a skeleton:
min = 100
max = 102
random_vec = sample(x = mi.......
In the function-building exercise above, I ask you to create a function that:
n
random integers between the values
of min
and max
.If I were working on this problem, I would first build a function skeleton like this:
create_and_print_vec(n, min = 1, max = 10)
{
# Step 1: Create a vector of n random numbers between min and max
my_random_vec = [your code goes here!]
# Step 2: Loop through all the values of my vector
for(i in .......)
{
[your loop code goes here]
}
}