Learning objectives, technical sills and concepts:

  • High and low level programming
  • data types: type casting and coercion
  • advanced subsetting
  • loops
  • custom functions

Introduction to Lab 2

The reading for this lab is long and winding, but it covers some important concepts you’ll need to know to be a wise R coder.

Data Types in R: preview

Quick review of logical tests:

You’ll need to become very familiar with logical tests and operators in R - a field also known as Boolean algebra.

The symbols for the most common logical tests we use in R are:

  • Test for equality: ==
  • Test for strict inequality: > and <
  • Test for equal or greater/less than: >= and <=
  • Test for non-equality !=

Some important logical operators are:

  • The NOT operator: !
    • You can think of this as flipping the polarity of a TRUE to a FALSE and vice versa.
  • The AND operator: &
    • This one returns a value of TRUE only if both elements are TRUE.
  • The OR operator: |
    • This one returns a value of TRUE if at least one of the elements is TRUE.
  • It will only return a FALSE when both of the test elements are FALSE.

I encourage you to play with these operators and tests to get an intuitive feel for what they do.

Optional info about additional operators (click to expand)

There are two related operators: && and ||. These evaluate only the first element of the objects they are comparing - they are not ‘vectorized’.

To avoid confusion and unexpected results we will avoid using && and || in this course.

To illustrate the difference, we can compare elements in some vectors.

a = c(T, F, F)
b = c(T, T, F)
c = c(T, T, F, T)
a & b
## [1]  TRUE FALSE FALSE

This produces an error since a and c are of different lengths.

a & c
## [1]  TRUE FALSE FALSE  TRUE
a && b
## [1] TRUE

This is valid R code since && only compares the first elements.

a && c
## [1] TRUE

Some odd questions:

  • What kind of a number is 4?
  • What about 4.0?
  • Is 1.0 equivalent to TRUE?
  • Is 3.0000000000000001 the same as 3.0?

Let’s ask R:

class(4)
## [1] "numeric"
class(4.0)
## [1] "numeric"
1.0 == TRUE
## [1] TRUE
3.0000000000000001 == 3.0
## [1] TRUE

Were those the answers you expected?

More oddities

Let’s see if these make any more sense:

  • Does arithmetic addition have a truth value?
(0 + 1) == TRUE
## [1] TRUE
  • What about subtraction?
(0 - 1) == TRUE
## [1] FALSE

Can I multiply or divide by TRUE or FALSE?

3.0 * TRUE
## [1] 3
4 / FALSE
## [1] Inf
FALSE / FALSE
## [1] NaN
3.0 * (TRUE + FALSE)
## [1] 3
3.0 * (TRUE - FALSE)
## [1] 3
3.0 * (FALSE - FALSE)
## [1] 0

My head hurts a little bit after writing all of those!

How might we make sense of those results?

Why am I even asking you to know about such nonsensical things?

Programming language hierarchy, abstraction, and data types

All of the above weirdness is related to how R implements the concepts of data typing, coercion, and type casting.

Which brings us back to the idea of high- and low- level programming languages.

The classification of a programming language as low- or high-level is related to the level of abstraction between what the programmer writes and the computer executes.

In other words, if a computer language does a lot of translating into information that a computer can understand, it is a high-level language.

A task in high-level and moderately-high level languages

In a high-level language (such as R), you might be able to calculate the sum of the elements of a matrix with a single command:

my_matrix = matrix(data = 1:9, nrow = 3, ncol = 3)
sum(my_matrix)
## [1] 45

Behind the scenes, R knows that it has to look the elements of my_matrix and keep a running total as it adds all of the values together. In this case, there is a high level of abstraction between what you type: sum(my_matrix) and what the computer actually does.

In contrast, if you wanted to do the same task in Java, which is a “moderately high-level” language you might have to write something like this (please don’t worry about trying to understand all of the code):


public class MatrixSumDemo
{
    public static int matrix_sum(int[][] input_matrix)
    {
        int running_total = 0;
        for (int i = 0; i < input_matrix[0].length; i++)
        for (int j = 0; j < input_matrix.length; j++)
        {
            running_total += input_matrix[i][j];
        }
        return running_total;
    }
    
    public static void main(String[] args) 
    {
        int[][] my_matrix = new int[3][3];
        
        my_matrix[0] = new int[] {1, 2, 3};
        my_matrix[1] = new int[] {4, 5, 6};
        my_matrix[2] = new int[] {7, 8, 9};
    
        System.out.println(matrix_sum(my_matrix));
    }       
}
## [1] 45

You might notice that I didn’t call Java a low-level language! This means that even though you have to spell out more of the steps in Java (compared to R), there are still many layers of abstraction between the Java code and the instructions your computer processor can understand. Imagine if you had to write a program in binary that you could submit directly to your computer’s processor to evaluate…

We like to use high-level languages, but…

There are some serious trade offs we should be aware when we use high level languages.

In the R/Java example you might have noticed that the text int occurred a lot in the Java code. That’s because Java requires us to specify exactly what kind of number we want a variable to hold (int specifies that a number is an integer). R tries to guess what kind of number we are using so we didn’t have to tell it that we wanted our matrix to be filled with integer values.

Usually that’s ok, but what happens when we try to multiply TRUE by 3?

Java wouldn’t even let create a program in which such an operation were possible! That’s a safeguard against having unpredictable or undefined behavior.

In R, however, TRUE * 3 is perfectly legal, even if it’s not clear what it means or why we would want to do such a thing.

If a friend asked you what you get when you divide TRUE by 5, how would you respond?

A logic to the weirdness

Believe it or not, statements like TRUE * 3 have extremely useful applications.

We’ll come across examples later in the course. You’ll get a chance to explore data type coercion and casting in this lab.

Nesting functions

We’ve mostly encountered r calls that use a single function.

For more complicated or sophisticated tasks, we often have to combine numerous functions.

Suppose I wanted to print the value of a randomly-generated integer.

I could:

  1. Create a variable to store the randomly-generated number.
  2. Create text of a sentence that stated the value of the number.
  3. Print the sentence
int_rnd = sample(100, 1)
int_rnd_sentence = paste0("The value of the randomly-generated number is: ", int_rnd)
print(int_rnd_sentence)
## [1] "The value of the randomly-generated number is: 75"

Or I could nest all of those tasks within a single function without creating the intermediate variables:

print(
  paste0(
    "The value of the randomly-generated number is: ", 
    sample(100, 1)))
## [1] "The value of the randomly-generated number is: 46"

Note that sample() was called within paste0() which was called within print().

Intro to loops

What is a loop?

  • It’s just a programming structure that allows you to repeat a bit of code many times. You might want to repeat the code a specified number of times, or you might want to repeat the code until a certain logical condition is met.

Here is a simple for-loop in R:

for (i in 1:10)
{
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Key items to note:

  • The for (... syntax lets R know that it will execute a for-loop.
  • Within the for syntax, the i in 1:10 tells R to execute the loop 10 times, using an index variable called i.
  • Note the sequence notation: 1:10. What does this expression do on its own?
  • Also note the special keyword in.
  • The code to execute is all contained in a set of curly braces {}
  • You can use the index variable within the loop. In this case I used print() to print the value during each pass through the loop.
  • The index variable changes its value with each pass through the loop (as shown by the output of print(i)).

We’ll look at other kinds of loops later. For now I encourage you to play with this loop skeleton to make sure you understand the syntax.

Using for-loops in R is controversial!

R also has a family of functions, the apply functions that can accomplish the same tasks as a loop.

Some folks prefer to only use the apply approach, while others prefer to only use the loop approach.

Loops in R tend to be slow compared to loops in many other languages.

There is also a train of thought that says the apply approach is more elegant or aesthetically appealing.

My opinion is that whether or not you choose to use or avoid loops in R, you need to know what loops are and how they work. Loops are a fundamental concept in computing, within and beyond the R world.

Intro to custom functions

You used a custom function in the in-class activity on Tuesday.

Here’s a very simple custom function:

print_number = function(n)
{
  print(paste0("The value of the number is ", n))
}
print_number(145)
## [1] "The value of the number is 145"

Things to notice:

  • use the function function() to define a new function.
  • the multiple meanings of the word function here is unfortunate.
  • The arguments to function() become the arguments to the new function you want to create. In this example there is only argument: n.
  • Just like a loop, the code that executes is written within curly braces {}
  • When the function executes, the variable n within the body of the function takes on the value that you supplied:
  • the code print_number(145) causes the variable n within the function to take the value 145.

Some terminology

  • Argument: an input to a function. A function can have zero or more arguments.

  • In R, arguments can have default values.

  • Arguments have names.

  • R expects the arguments to be supplied in the order specified by the function definition. Unless…

  • You can input the arguments in any order if you specify their names.

  • Function body: the code that is called inside the function.

  • The tasks that the function performs are all written within the function body.

  • The function body is written within curly braces {}.

  • Function bodies can contain many lines of code.

  • Return value: the value that a function produces.

  • Functions do not have to have a return value.

  • The print_number() function above does not have a return value.

Arguments

All of the following calls to rnorm() are identical.

rnorm(10)
rnorm(n = 10, sd = 1)
rnorm(sd = 1, mean = 0, n = 10)

If you consult the R help entry for rnorm() you will see that the 3 arguments are (in order):

  • n
  • mean
  • sd

Both mean and sd have default values (0 and 1, respectively.

Check out the R-help entry for rnorm() by typing ?rnorm into the console window:

?rnorm

Lab Questions

Logical Subsetting I: Questions 1 -2

You’ve used logical subsetting to select elements of a matrices and vectors. With small data sets it’s possible to look at all of the elements at once and visually detect the indices of the elements you want. This is not possible with larger data sets.

Run the following code to create a large vector containing randomly generated integers between 1 and 12:

n = 12345
vec_1 = sample(12, n, replace = TRUE)
head(vec_1)

Use a logical test operator to create a Boolean vector (called vec_2) whose entries are TRUE if the corresponding entry in vec_1 is 3 and FALSE otherwise.

Self test: you can use vec_2 to retrieve all of the 3 elements of vec_1 using the following:

vec_1[vec_2]

You should see a vector whose elements are all 3.

  • Q1 (2 pts.): Show the R code you used to create vec_2.

Your code should be a complete and self-contained example. I should be able to paste your code into a fresh R session on my computer and re-create your vec_2

  • Q2 (2 pts.): Give two reasons why determining which elements in vec_1 have value 3 by visual inspection is a bad idea.

Logical Subsetting II: Questions 3 - 5

Run the following code to create a large vector containing randomly generated integers between 1 and 12:


n = 12345
vec_1 = sample(12, n, replace = TRUE)
head(vec_1)



Use the function length() to determine how many elements are in vec_1.


Now, run the following line to check how many entries have the value 3:

sum(vec_1 == 3)



Finally, run the following code several times taking note of how many 3 entries appear each time you run it.

n = 10
vec_1 = sample(12, n, replace = TRUE)
paste0("Sum of elements with value 3: ", sum(vec_1 == 3))
  • Q3 (1 pt.): Why didn’t you always get the same count of 3 entries each time?
  • Q4 (3 pts.): Considering the different vectors generated each time, explain why using a logical test is a safe way to select entries with a value of 3.
  • Q5 (5 pts.): Explain why performing logical ‘by hand’ subsetting is very very bad practice. You may want consider re-usability of code, working with different sized data sets, and sharing code with collaborators.
    • Your answer should cite at least two reasons why ‘by hand’ subsetting is bad.

Basic Loops: Question 6

You may want to review the for-loop example in the lab walkthrough.

for (i in 1:10)
{
  print(i)
}


Modify the code in the body of the loop to print out a message like “This is loop iteration: 1” for each run through the loop.


  • Hint: use the print() and [paste() or paste0()] functions.
  • Hint: review the nesting functions example in the lab walkthrough.


:::{.questions}

  • Q6 (3 pts.): Provide the code for your modified loop. It must run as a self-contained example on a fresh R session on my computer.

Intermediate Loops: Question 7

You may want to review the for-loop example in the lab walkthrough.

Run the following code on your computer:

for (i in 1:10)
{
  print(i)
}

Note that the loop runs through exactly 10 iterations…

What if you wanted the loop to execute an arbitrary number of times?


  • Create a variable, n, that contains an integer value.
  • Modify the code for the loop so that it runs n times.
  • Q7 (2 pts.): Provide the code for the modified loop that executes n times. It needs to be a self contained example. I should be able to set the value of n and then run your loop on my computer.

Intermediate Loops 2: Question 8

  • Create an integer variable, n, that holds the value 17.
  • Write code to create a vector called vec_1 of length n. vec_1 should contain [pseudo]randomly generated integers between 1 and 10.
  • Hint: Check out the help entry and consult Dr. Google on how to use the sample() R function.
  • Hint: Take a look at the code I provided in an earlier question for an example of how to create a vector of random values.

Now, create a loop that:

  • Iterates n times (once for each element of vec_1).
  • Prints a message that includes the iteration number as well as the corresponding element of vec_1


Your output should look something like this:

## The element of vec_1 at index 1 is 4.
## The element of vec_1 at index 2 is 10.
## The element of vec_1 at index 3 is 3.
## The element of vec_1 at index 4 is 2.
## The element of vec_1 at index 5 is 2.
## The element of vec_1 at index 6 is 9.

Hint: Think of what code you’ll need to include within the body of the loop to display: 1. index number 1. The value of vec_1 at the index.

  • Q8 (4 pts.): Provide the code you used to create the n, vec_1, and the loop. As always, it should run as a stand-alone example in a fresh R session on my computer.

Functions: Question 9

Objective

Write a function create_and_print_vec().

  • You’ve created a loop that can print a message with the value of each element of a vector of arbitrary length.
  • Now you’ll wrap it all into a custom function.
  • Review the lab materials about writing custom functions in the lab walkthrough.

Function Arguments

Your function should take three integer arguments, n, min, and max.

  • n has no default value.
  • min has a default value of 1.
  • max has a default value of 10.

Function Output

Your function needs to do the following:

  • Create a vector of n random integers between the values of min and max.
  • Loop through the elements of the vector and print a message with the index of the element and its value.

Here’s a skeleton:

create_and_print_vec = function(n, min = , max =)
{
  # Function body goes here
}
  • Hint: You’ve already written code that accomplishes almost all of these tasks.
  • Hint: You’ve used a single number as the first argument to sample(). Look up the R help entry to see what other kinds of values the x argument of sample() can accept.

Your function should create output like this with default values for min and max:

## [1] "The element at index 1 is 3"
## [1] "The element at index 2 is 3"
## [1] "The element at index 3 is 3"
## [1] "The element at index 4 is 3"
## [1] "The element at index 5 is 2"
## [1] "The element at index 6 is 2"
## [1] "The element at index 7 is 2"
## [1] "The element at index 8 is 3"
## [1] "The element at index 9 is 2"
## [1] "The element at index 10 is 2"
## [1] "The element at index 11 is 3"
## [1] "The element at index 12 is 3"
## [1] "The element at index 13 is 3"
## [1] "The element at index 14 is 2"
## [1] "The element at index 15 is 2"
## [1] "The element at index 16 is 3"
## [1] "The element at index 17 is 2"
## [1] "The element at index 18 is 2"
## [1] "The element at index 19 is 3"
## [1] "The element at index 20 is 2"

When you use min = 100 and max = 2000 your output should resemble:

create_and_print_vec(10, min = 100, max = 2000)
## [1] "The element at index 1 is 1600"
## [1] "The element at index 2 is 289"
## [1] "The element at index 3 is 1818"
## [1] "The element at index 4 is 1999"
## [1] "The element at index 5 is 1963"
## [1] "The element at index 6 is 1643"
## [1] "The element at index 7 is 382"
## [1] "The element at index 8 is 1802"
## [1] "The element at index 9 is 1950"
## [1] "The element at index 10 is 1484"
  • Q9 (10 pts.): Provide the code you used to build your function.
    • To receive full credit your code must run without error on a new R session and produce output similar to the examples given in the instructions.

Report

Compile your answers to all 9 questions into a pdf document and submit via Moodle.

  • You may also do your work in an R Notebook and submit a rendered html file.

Supplement: Hints

Here is a collection of hints I’ve compiled based on student questions over the years. Perhaps your questions could contribute to new hints!

You may not need these to complete the lab, so don’t peek if you don’t want to!

However, if you are stuck you may find something useful here.

Using the sample() function

There is a lot of info in the help entry for sample(), but it can be difficult to understand R help entries, especially when you’re first starting with R!

Hint 1: Draw a random integer between 1 and n.

When sample() has x equal to a single value, it draws random numbers from 1 up to that number.

Let’s try a simple example with 1’s and 2’s:

sample(x = 2, size = 1)
## [1] 1

Try running the code several times and look at the output.

Now, let’s use a variable:

n = 5
sample(x = n, size = 1)
## [1] 5

Hint 2: Draw a random integer within a range of values.

Let’s say I wanted to create a random collection of 5 numbers, all in the range of 100 to 102:

random_vec = sample(x = 100:102, size = 5, replace = TRUE)
random_vec
## [1] 100 101 102 102 101
  • Try running the code several times to get a feel for what’s happening.

I’ll let you experiment using variables for the minimum and maximum values of the range you want to sample from.

  • Try creating variables min and max with the values 100 and 102 to produce the same results as the snippet above.

Here’s a skeleton:

min = 100
max = 102

random_vec = sample(x = mi.......

Strategies for building functions

Build a function skeleton

In the function-building exercise above, I ask you to create a function that:

  1. Create a vector of n random integers between the values of min and max.
  2. Loops through the elements of the vector and prints a message with the index of the element and its value.

If I were working on this problem, I would first build a function skeleton like this:

create_and_print_vec(n, min = 1, max = 10)
{
  
  # Step 1: Create a vector of n random numbers between min and max
  
  my_random_vec = [your code goes here!]
  
  # Step 2: Loop through all the values of my vector
  
  for(i in .......)
  {
    [your loop code goes here]
    
  }
}