Learning objectives, technical skills and concepts:

  • Random number generation
  • Defining and quantifying error
  • Intro to optimization

Probability distributions in R

R has built-in functions to calculate probability density (or mass), cumulative probability, and to generate random numbers sampled from lots of distributions.

We’ll focus on:

  • Probability functions
  • Cumulative probability functions
  • Generating random deviates

Each common distribution has a set of 4 R functions. The naming scheme is as follows (using the normal distribution as an example):

  • dnorm(): the probability density

  • pnorm(): the cumulative probability density

  • qnorm(): the quantile function

  • rnorm(): function to generate random, normally-distributed numbers.

  • Probability distributions may have one or more parameters. You’ll need to check out the help entry for the functions for the particular distribution you want to use to see how to specify them.

Practice Exercise: name the distribution function

In the plots below of standard normal curves below (mean = 0, sd = 1), name the functions I used to calculate the:

  • Probability density, i.e. the height of the curve at x
  • Cumulative density, i.e. the area under the curve left of x

Hint: I used the following values of x: -1.96, -1, 0, 1.96.

  • You can test your answers with these values to check that you have the right functions.

Plotting a PDF curve

Study the following example of how to create a basic plot of a normal curve:

# Generate a vector of x-values
x = seq(-3, 3, length.out = 1000)
y = dnorm(x)

plot(x, y, main = "Normal PDF", type = "l")
abline(h = 0)

Things to consider:

  • How did I create the y-values (the density)?
  • What are the mean and standard deviation of the distribution that I used?
  • What arguments could I use to change those values?
  • How did I plot a horizontal line at y = 0?

Sampling from a distribution

It’s easy to generate [pseudo] random numbers in R!

You’ve already used sample().

All of the common parametric distributions have corresponding random number generating functions in R.

Random deviates functions

The random deviates functions start with the letter r, for example:

  • rnorm()
  • rbinom()
  • rpois()

will generate random numbers following a Normal, binomial, and poisson distribution, respectively.

Penguins Example

It’s often useful to generate random numbers to compare against actual data. For example, I may want to compare histograms of observed penguin body masses against randomly generated body masses to see if they are consistent with each other.

To plot a histogram of body masses:

require(palmerpenguins)
hist(
  penguins$body_mass_g,
  main = "Histogram of Penguin Body Mass",
  xlab = "Body Mass (g)")

  • Note the histogram looks a little bit skewed.

I could ask: If the data came from a normal distribution, what might the histogram look like?

To generate random numbers from a Normal distribution, I need to know three things:

  1. The number of observations to create
  2. The mean
  3. The standard deviation.

We can get all of those from the penguin data itself:

mean(penguins$body_mass_g, na.rm = TRUE)
sd(penguins$body_mass_g, na.rm = TRUE)
nrow(penguins)
## [1] 4201.754
## [1] 801.9545
## [1] 344

There are 344 observations with a mean of 4202 grams and standard deviation of 802 grams.

What did the na.rm = TRUE argument do in my calls to mean() and sd()?

  • Check the R documentation and try running the code with na.rm = FALSE.

Random Penguin Masses

I’ll generate four vectors of random penguin body masses using the quantities I calculated from the observed values:

n_pts        = 344
penguin_mean = 4202
penguin_sd   = 802

dat_1 = rnorm(n = 344, mean = 4202, sd = 802)
dat_2 = rnorm(n = 344, mean = 4202, sd = 802)
dat_3 = rnorm(n = 344, mean = 4202, sd = 802)
dat_4 = rnorm(n = 344, mean = 4202, sd = 802)

Hint: a better way to create simulated masses

Did you notice that I typed the same numbers four times to generate the 4 data sets?

This is generally bad practice! What if I wanted to simulate data with a different standard deviation? I would have to remember to change the value in four different locations; this is subject to human error.

Here’s one way I could use variables to soft code my population params:

n_samples = 344
pop_sd = 802
pop_mean = 4202

dat_1 = rnorm(n = n_samples, mean = pop_mean, sd = pop_sd)
dat_2 = rnorm(n = n_samples, mean = pop_mean, sd = pop_sd)
dat_3 = rnorm(n = n_samples, mean = pop_mean, sd = pop_sd)
dat_4 = rnorm(n = n_samples, mean = pop_mean, sd = pop_sd)

Simulated Body Mass Histograms

Now, I can plot histograms of my simulated data:

par(mfrow = c(2, 2))

hist(dat_1)
hist(dat_2)
hist(dat_3)
hist(dat_4)

There is variability because the numbers were randomly generated, do the histograms look consistent with the observed data?

Compare these to the histogram of the real data (click to show/hide)

Random Uniform Numbers

The runif() function will generate random, uniformly-distributed numbers:

dat_unif = runif(n = 27, min = 0, max = 4)
hist(dat_unif)

  • A uniform distribution should look flat, but that histogram doesn’t look flat at all!

How could I modify my sampling procedure to get a more representative sample?

Hint: It’s all about sample size (click to show/hide)

Small samples are subject to sampling error, i.e. non-representative samples. If I randomly generate more numbers, my histograms will look more uniform:

dat_unif = runif(n = 270, min = 0, max = 4)
hist(dat_unif)

That looks more uniform, but note that even with 270 samples there’s still a lot of variability!

You’ll get a chance to experiment with random numbers and sample sizes in the lab questions.

Randomness and Replication: set.seed()

Note that our computers can’t generate truly random numbers. Rather, R uses sophisticated random number generators (RNGs) to create sets of pseudorandom numbers which have the same statistical properties of sets of numbers that are truly random.

The values that the generators produce depend on an internal state, which we can initialize using the set.seed() function. If we don’t explicitly set the random seed, the RNG generates its own seed when R starts up, using the current time.

It’s often useful to have R generate the same sequence of random numbers multiple times. Here’s what happens when I set the random seed before generating random, uniformly distributed data:

set.seed(1)
dat_unif_1 = runif(n = 270, min = 0, max = 4)
set.seed(1)
dat_unif_2 = runif(n = 270, min = 0, max = 4)

par(mfrow = c(1, 2))
hist(dat_unif_1)
hist(dat_unif_2)

  • The two data sets contain the same numbers!

A Random Number Sequence Analogy


It can be difficult to understand what set.seed() is doing.

Here’s an analogy that might help (click to show/hide)

Note: I did not write this analogy, and I don’t know the original author. If anybody knows the source, please let me know!

Suppose you produce a radio show and you want want to have a sample of the kinds of sounds you might hear on a typical city street to use as background sounds for brief (4 second) transitions between segments. You set up a microphone and record the sounds for four hours. During that time you might record birds, construction noises, bits of conversations, car sounds, etc.


You could think of your recording as a realization of a stochastic process.

  • You could make another recording, and the sequence of sounds you record would be different.


Now, suppose you want each transition segment to be different, so you create a sequence of starting times in your four-hour recording. Starting times are separated by 5 minutes. You end up with a sequence of 48 starting times in your recording.


Since it’s a busy street, the sounds you hear now, and the sounds you might hear 5 minutes from now aren’t related at all. In a sense, they are independent.


You set up your transitions so that after you play segment 48, it loops back to segment 1 and advances until you reach segment 48 again. Now you’ve got a loop of random street sounds!

You need about 4 - 5 transition segments per episode of your show. When you are ready to record an episode you simply choose a position in the cycle of street sounds and voila, you’ve got your random city noise transitions set up for the show!

You could start each show at a different position so you’d have a different set of random street noise transitions. Since you only use a very small proportion of the sounds in your cycle, you’re unlikely to use the same sequence in two episodes.

On the other hand, if you found a particular sequence of transitions you liked, you could just start each episode at the same starting position.

R’s random number generators work similarly, except that instead of a repeating sequence of 48 random numbers, the number is very very very large. In fact, if you generated a billion random numbers, you would not repeat the sequence!

When you use set.seed(), you are specifying a starting position in the sequence of random numbers that R will generate. If you don’t set an explicit seed, you’re very unlikely to get the same sequence twice!

Measuring error

Let’s calculate some residuals!

What’s a residual?

  • A residual is the difference between a predicted value and the observed value.

Here’s some points and a line of predicted values from a deterministic model I created.

  • I guessed the slope and intercept parameters for the linear function.

We’ll need the function I built to find a y-value given a point and slope:

# Calculates the value of y for a linear function, given the coordinates
# of a known point (x1, y1) and the slope of the line.
line_point_slope = function(x, x1, y1, slope)
{
  get_y_intercept = 
    function(x1, y1, slope) 
      return(-(x1 * slope) + y1)
  
  linear = 
    function(x, yint, slope) 
      return(yint + x * slope)
  
  return(linear(x, get_y_intercept(x1, y1, slope), slope))
}

We’ll use it below.

Don’t worry about understanding the contents of the code, simply copy and past it into your R script or R Markdown document and run it so that it is available in R’s environment.

set.seed(123)
n = 17
slope = 0.7
intcp = 0.2

guess_x = 6
guess_y = 4
guess_slope = 0.72

x = runif(n = n, min = 1, max = 10)
y = rnorm(n = n, mean = slope * x + intcp)

plot(x, y, pch = 16)
curve(line_point_slope(x, guess_x, guess_y, guess_slope), add = T)

The residuals are the difference between the predicted and observed values:

Create the data

If we visually estimate a linear function to create a deterministic model to fit a set of points, we can use it to calculate predicted values.

We can use some simulated data as an example. Simulated data can be very helpful for learning because it is well-behaved, unlike real messy data.

  • Create a data.frame with some random values:
n_pts = 10
x_min = 1
x_max = 10

# X values are uniformly distributed
x_random = runif(n = n_pts, min = x_min, max = x_max)

# Y values are normally-distributed.
# I used the default parameters for mean and sd.
y_random = rnorm(n = n_pts)

dat_random = data.frame(x = x_random, y = y_random)

plot(y ~ x, data = dat_random, pch = 8)

Each time run the code, you’ll get different numbers. Why?

  • Try running it several times to see what happens. You can try different numbers of points and different point shapes.

We can force R to generate the same sequence using the set.seed() function before we use the random-number generator.

Try using set.seed(123) before running the code above. Run it twice and see what happens.

Try using a different seed number.

Fit a linear deterministic model

Use the line_point_slope() to estimate a linear model for the simulated points.

  • Hint: you will probably want to use set.seed() so that you can build the same “random” points while you’re working on this.

Be sure to save the values of the point coordinates and slope of your visually-estimated linear model into variables so you can re-use and change them easily later on.

Here’s my guess:

guess_x = 6
guess_y = 0
guess_slope = 0.1

plot(y ~ x, data = dat_random, pch = 8)
curve(line_point_slope(x, guess_x, guess_y, guess_slope), add = T)

Add predicted values

I can calculate the predicted y values based on my estimated model parameters:

line_point_slope(dat_random$x, guess_x, guess_y, guess_slope)
##  [1] -0.102019933  0.219032361 -0.390290666  0.004853185 -0.314121749 -0.385221515  0.177977078  0.305540823 -0.162983502  0.098603675

I’ll add these as a column called y_predicted to dat. I’ll let you figure out the code to add the column.

Calculate the residuals

Now I can finally calculate the residuals!

The residuals are the difference between the predicted and observed values…

I’ll add them as a column called resids. I’ll let you figure out the code.

What happens when you calculate the sum of the resids column?

I got the following:

sum(dat_random$resids)
## [1] -5.359858

Notice that the negative and positive residuals partially cancel out. How would I calculate the sum of the absolute values of the residuals?

  • Hint: check out the abs() function.
  • Challenge question: Should a good model fit have a low or high sum of residuals?

Plot the residuals

A fitted values vs. residuals scatterplot is worth a thousand words:

A histogram of residual values is also helpful:

Lab Questions

Note on pasted code!

If you create your report in Word, Google Docs, etc, make sure you include runnable code in your answers.

When you include snippets of code, you must copy your code from a text editor (such as the script window in RStudio). - If you copy your code from the R Console, you will copy the > characters at the beginning of the lines and it won’t run directly when pasted into the console (i.e. when we grade your code).

This is OK:

norm_17 = ....your code here...

This is incorrect:

> norm_17 = ....your code here...

Normal Vectors (Q1)

Instructions:

  1. Create four vectors of normally-distributed random numbers, norm_17, norm_30, norm_300, and norm_3000.
  • You should tell R that you want your random deviates have a mean of 10.4 and a standard deviation of 2.4.
  • norm_17 should have 17 elements, norm_30 should have 30 elements, norm_300 should have 300 elements, and norm_3000 should have 3000 elements.

Q1 Vector Code

  • (1 pt.) Show the code you used to create your vectors.
  • Hint: I need to be able to run your code on a fresh R session on my computer. You should test your answers in a fresh R session.

  • Hint: You should save the parameters you use for mean and standard deviation into variables so that they are easy to re-use (and easy to change later, if needed).

Normal Vectors: Histograms (Q 2-5)

Instructions:

  1. Create a figure including histograms of your vectors norm_17, norm_30, norm_300, and norm_3000.
  • The histograms should all be on the same figure, arranged in panels (2 rows, 2 columns).
  • Each histogram must have an informative title that indicates how many randomly generated data points were used to build the histogram.
  1. Save your figure to a file called lab_04_hist_01.png.
  • Your figure should be 1600 pixels high, and 1500 pixels wide.
  • It should have a resolution of 180 dpi.
    • Check out the description of the res argument int he help entry for png().
    • My walkthrough of saving figures to a file may be helpful to you.

Q2 Histogram Code

  • (2 pts.) Include the R code you used to create your figure. Your answer should include code that builds the figure as well as saves it to a file.

Q3 Histogram Figure

  • (4 pts.) Upload your lab_04_hist_01.png file to Moodle. Make sure you double check the image size and resolution requirements.

Q4 Histogram Shapes 1

  • (2 pts.) Qualitatively describe the differences among the histograms.

Q5 Histogram Shapes 2

  • Explain why the shapes of the histograms are different.

Q6 Standard Normal

Q6: (1 pt) What are the parameters and their values for the standard Normal distribution?

Normal Density Function (Q 7-8)

Instructions

  1. Recall the code I used to create a plot of the density function of a standard normal distribution:
# Generate a vector of x-values
x = seq(-6, 6, length.out = 1000)
y = dnorm(x)

plot(x, y, main = "Standard Normal PDF", type = "l", xlim = c(-3, 3))
abline(h = 0)

  • What happens when you change the xlim values?
  1. Use the code above as a template to plot a density curve for a normal distribution with:
    • mean = 10.4
    • standard deviation = 2.4
  2. Include an informative title that states the mean and standard deviation values you used.
  3. Save your figure to a Scalable Vector Graphics (svg) file called norm_1.svg.
    • Use the svg() function to create your image.
    • Make sure to check out the help entry for info on how to specify width and height in svg()
    • If you are using a Mac and you receive errors about Cairo or Cairo.DLL, you may have trouble using svg. In this case use pdf() instead. Just make sure you name you file norm_1.pdf instead of norm_1.svg. The width and height arguments do not have the same units in pdf() and svg(). svg() uses dpi while pdf() uses inches.
  • Check out the help entry for dnorm() for a description of the arguments.

Q7 Density Figure Code

  • (2 pts.) Include the R code you used to create your figure. Your answer should include code that builds the figure as well as saves it to a file.

Q8 Density Figure File

  • (2 pts.) Upload norm_1.svg (or norm_1.pdf).

For reference, here is how my figure turned out:

  • Note that the peak is in the center of the plot. You might need to adjust the x limits of the plot window on your figure.

Random Data (Q 9-10)

Instructions:

  1. Review the section of the lab instructions in which I created a data.frame of random data.

  2. Experiment with different sets of randomly-generated data:

  • Try different numbers of points.
  • Try different distributions in the x- and y- values.
    • I used rnorm and runif. R has pre-made random number generating functions for lots of other distributions.
  • Try using set.seed() with different seed values.
  1. Plot your data! Try making histograms, boxplots, and scatterplots with different plotting parameters:
  • Try different plotting characters.
  • Try different colors and alpha (transparency) values.
    • rgb() and adjustcolor() may be helpful.
    • “steelblue” is a favorite of mine.
  1. After you’ve experimented with different data sets and plots, choose four of your favorites to create a figure.
  • Arrange your four plots into a single figure: it should be a 2 by 2 grid of panels.
  1. Save your figure to a file with an appropriate filename.

Q9 Random Data Set

  • (3 pts.) Show the R code you used to create one of the random datasets in your figure.

Q10 Random Data Image File

  • (2 pts.) Upload an image file of your figure. It may be in png (raster graphics), svg (vector graphics), or pdf (vector graphics) format.
    • Check out the corresponding functions png(), pdf(), and svg() for info.

Random Data Model Fit (Q 11-12)

Instructions:

  1. Choose one of the datasets you created for the previous question.
  2. Adapting the code in the lab walkthrough, visually fit a linear deterministic function through the data.
    • Make sure you save your parameters to variables so you can use them in the next question.
  3. Create a plot of your simulated data and your fitted linear model.
Click here to show/hide the line_point_slope() function
# Calculates the value of y for a linear function, given the coordinates
# of a known point (x1, y1) and the slope of the line.
line_point_slope = function(x, x1, y1, slope)
{
  get_y_intercept = 
    function(x1, y1, slope) 
      return(-(x1 * slope) + y1)
  
  linear = 
    function(x, yint, slope) 
      return(yint + x * slope)
  
  return(linear(x, get_y_intercept(x1, y1, slope), slope))
}

Q11 Random Data Model Fit (Q 13-14)

  • (3 pts.) Show the R code you used to create one of the random datasets in your figure.

Q12 Random Data Model Fit Image File

  • (2 pts.) Upload an image file of your figure. It may be in png (raster graphics), svg (vector graphics), or pdf (vector graphics) format.
    • Check out the corresponding functions png(), pdf(), and svg() for info.

Random Data Model Residuals (Q13-14)

Instructions:

  1. Use the dataset you chose for the previous question.
  2. Adapting the code in the lab walkthrough, create a column of predicted y-values.
  3. Adapting the code in the lab walkthrough, create a column of residuals.

Q13 Random Data Model Residuals (Q 13-14)

  • (2 pts.) Paste the R code you used to create create the columns of predicted values and residuals.

Q14 Random Data Model Residual Plot

  • (3 pts.) In your report, include the two following figures
    • A histogram of the model’s residuals.
    • A scatterplot of your model’s predicted values (on the x-axis) and residuals (on the y-axis).

Make sure your figure has appropriate title, legend, and other annotation as needed.

Report

Compile your answers to all 14 questions into a pdf document and submit via Moodle.

  • You may also do your work in an R Notebook and submit a rendered html file.

You must upload the four required figure image files to get full credit:

  • Image for Q3: lab_04_hist_01.png
  • Image for Q8: norm_1.svg or norm_1.pdf
  • Image for Q10: [your choice of descriptive filename]
  • Image for Q12: [your choice of descriptive filename]

You can check out my saving images to file tutorial for guidance.