R has built-in functions to calculate probability density (or mass), cumulative probability, and to generate random numbers sampled from lots of distributions.
We’ll focus on:
Each common distribution has a set of 4 R functions. The naming scheme is as follows (using the normal distribution as an example):
dnorm()
: the probability density
pnorm()
: the cumulative probability density
qnorm()
: the quantile function
rnorm()
: function to generate random,
normally-distributed numbers.
Probability distributions may have one or more parameters. You’ll need to check out the help entry for the functions for the particular distribution you want to use to see how to specify them.
In the plots below of standard normal curves below (mean = 0, sd = 1), name the functions I used to calculate the:
Hint: I used the following values of x: -1.96, -1, 0, 1.96.
Study the following example of how to create a basic plot of a normal curve:
# Generate a vector of x-values
x = seq(-3, 3, length.out = 1000)
y = dnorm(x)
plot(x, y, main = "Normal PDF", type = "l")
abline(h = 0)
Things to consider:
It’s easy to generate [pseudo] random numbers in R!
You’ve already used sample()
.
All of the common parametric distributions have corresponding random number generating functions in R.
The random deviates functions start with the letter r, for example:
rnorm()
rbinom()
rpois()
will generate random numbers following a Normal, binomial, and poisson distribution, respectively.
It’s often useful to generate random numbers to compare against actual data. For example, I may want to compare histograms of observed penguin body masses against randomly generated body masses to see if they are consistent with each other.
To plot a histogram of body masses:
require(palmerpenguins)
hist(
penguins$body_mass_g,
main = "Histogram of Penguin Body Mass",
xlab = "Body Mass (g)")
I could ask: If the data came from a normal distribution, what might the histogram look like?
To generate random numbers from a Normal distribution, I need to know three things:
We can get all of those from the penguin data itself:
mean(penguins$body_mass_g, na.rm = TRUE)
sd(penguins$body_mass_g, na.rm = TRUE)
nrow(penguins)
## [1] 4201.754
## [1] 801.9545
## [1] 344
There are 344 observations with a mean of 4202 grams and standard deviation of 802 grams.
What did the na.rm = TRUE
argument do in my calls to
mean()
and sd()
?
na.rm = FALSE
.I’ll generate four vectors of random penguin body masses using the quantities I calculated from the observed values:
n_pts = 344
penguin_mean = 4202
penguin_sd = 802
dat_1 = rnorm(n = 344, mean = 4202, sd = 802)
dat_2 = rnorm(n = 344, mean = 4202, sd = 802)
dat_3 = rnorm(n = 344, mean = 4202, sd = 802)
dat_4 = rnorm(n = 344, mean = 4202, sd = 802)
Hint: a better way to create simulated masses
Did you notice that I typed the same numbers four times to generate the 4 data sets?
This is generally bad practice! What if I wanted to simulate data with a different standard deviation? I would have to remember to change the value in four different locations; this is subject to human error.
Here’s one way I could use variables to soft code my population params:
n_samples = 344
pop_sd = 802
pop_mean = 4202
dat_1 = rnorm(n = n_samples, mean = pop_mean, sd = pop_sd)
dat_2 = rnorm(n = n_samples, mean = pop_mean, sd = pop_sd)
dat_3 = rnorm(n = n_samples, mean = pop_mean, sd = pop_sd)
dat_4 = rnorm(n = n_samples, mean = pop_mean, sd = pop_sd)
Now, I can plot histograms of my simulated data:
par(mfrow = c(2, 2))
hist(dat_1)
hist(dat_2)
hist(dat_3)
hist(dat_4)
There is variability because the numbers were randomly generated, do the histograms look consistent with the observed data?
The runif()
function will generate random,
uniformly-distributed numbers:
dat_unif = runif(n = 27, min = 0, max = 4)
hist(dat_unif)
How could I modify my sampling procedure to get a more representative sample?
Small samples are subject to sampling error, i.e. non-representative samples. If I randomly generate more numbers, my histograms will look more uniform:
dat_unif = runif(n = 270, min = 0, max = 4)
hist(dat_unif)
That looks more uniform, but note that even with 270 samples there’s still a lot of variability!
You’ll get a chance to experiment with random numbers and sample sizes in the lab questions.
set.seed()
Note that our computers can’t generate truly random numbers. Rather, R uses sophisticated random number generators (RNGs) to create sets of pseudorandom numbers which have the same statistical properties of sets of numbers that are truly random.
The values that the generators produce depend on an internal state,
which we can initialize using the set.seed()
function. If
we don’t explicitly set the random seed, the RNG generates its
own seed when R starts up, using the current time.
It’s often useful to have R generate the same sequence of random numbers multiple times. Here’s what happens when I set the random seed before generating random, uniformly distributed data:
set.seed(1)
dat_unif_1 = runif(n = 270, min = 0, max = 4)
set.seed(1)
dat_unif_2 = runif(n = 270, min = 0, max = 4)
par(mfrow = c(1, 2))
hist(dat_unif_1)
hist(dat_unif_2)
It can be difficult to understand what set.seed()
is
doing.
Note: I did not write this analogy, and I don’t know the original author. If anybody knows the source, please let me know!
Suppose you produce a radio show and you want want to have a sample of the kinds of sounds you might hear on a typical city street to use as background sounds for brief (4 second) transitions between segments. You set up a microphone and record the sounds for four hours. During that time you might record birds, construction noises, bits of conversations, car sounds, etc.
You could think of your recording as a realization of a stochastic process.
Now, suppose you want each transition segment to be different, so you create a sequence of starting times in your four-hour recording. Starting times are separated by 5 minutes. You end up with a sequence of 48 starting times in your recording.
Since it’s a busy street, the sounds you hear now, and the sounds you might hear 5 minutes from now aren’t related at all. In a sense, they are independent.
You set up your transitions so that after you play segment 48, it loops back to segment 1 and advances until you reach segment 48 again. Now you’ve got a loop of random street sounds!
You need about 4 - 5 transition segments per episode of your show. When you are ready to record an episode you simply choose a position in the cycle of street sounds and voila, you’ve got your random city noise transitions set up for the show!
You could start each show at a different position so you’d have a different set of random street noise transitions. Since you only use a very small proportion of the sounds in your cycle, you’re unlikely to use the same sequence in two episodes.
On the other hand, if you found a particular sequence of transitions you liked, you could just start each episode at the same starting position.
R’s random number generators work similarly, except that instead of a repeating sequence of 48 random numbers, the number is very very very large. In fact, if you generated a billion random numbers, you would not repeat the sequence!
When you use set.seed()
, you are specifying a starting
position in the sequence of random numbers that R will generate. If you
don’t set an explicit seed, you’re very unlikely to get the same
sequence twice!
What’s a residual?
Here’s some points and a line of predicted values from a deterministic model I created.
We’ll need the function I built to find a y-value given a point and slope:
# Calculates the value of y for a linear function, given the coordinates
# of a known point (x1, y1) and the slope of the line.
line_point_slope = function(x, x1, y1, slope)
{
get_y_intercept =
function(x1, y1, slope)
return(-(x1 * slope) + y1)
linear =
function(x, yint, slope)
return(yint + x * slope)
return(linear(x, get_y_intercept(x1, y1, slope), slope))
}
We’ll use it below.
Don’t worry about understanding the contents of the code, simply copy
and past it into your R script or R Markdown document and run it so that
it is available in R’s environment.
set.seed(123)
n = 17
slope = 0.7
intcp = 0.2
guess_x = 6
guess_y = 4
guess_slope = 0.72
x = runif(n = n, min = 1, max = 10)
y = rnorm(n = n, mean = slope * x + intcp)
plot(x, y, pch = 16)
curve(line_point_slope(x, guess_x, guess_y, guess_slope), add = T)
The residuals are the difference between the predicted and observed values:
If we visually estimate a linear function to create a deterministic model to fit a set of points, we can use it to calculate predicted values.
We can use some simulated data as an example. Simulated data can be very helpful for learning because it is well-behaved, unlike real messy data.
data.frame
with some random values:n_pts = 10
x_min = 1
x_max = 10
# X values are uniformly distributed
x_random = runif(n = n_pts, min = x_min, max = x_max)
# Y values are normally-distributed.
# I used the default parameters for mean and sd.
y_random = rnorm(n = n_pts)
dat_random = data.frame(x = x_random, y = y_random)
plot(y ~ x, data = dat_random, pch = 8)
Each time run the code, you’ll get different numbers. Why?
We can force R to generate the same sequence using the
set.seed()
function before we use the random-number
generator.
Try using set.seed(123)
before running the code above.
Run it twice and see what happens.
Try using a different seed number.
Use the line_point_slope()
to estimate a linear model
for the simulated points.
set.seed()
so that
you can build the same “random” points while you’re working on
this.Be sure to save the values of the point coordinates and slope of your visually-estimated linear model into variables so you can re-use and change them easily later on.
Here’s my guess:
guess_x = 6
guess_y = 0
guess_slope = 0.1
plot(y ~ x, data = dat_random, pch = 8)
curve(line_point_slope(x, guess_x, guess_y, guess_slope), add = T)
I can calculate the predicted y values based on my estimated model parameters:
line_point_slope(dat_random$x, guess_x, guess_y, guess_slope)
## [1] -0.102019933 0.219032361 -0.390290666 0.004853185 -0.314121749 -0.385221515 0.177977078 0.305540823 -0.162983502 0.098603675
I’ll add these as a column called y_predicted
to
dat
. I’ll let you figure out the code to add the
column.
Now I can finally calculate the residuals!
The residuals are the difference between the predicted and observed values…
I’ll add them as a column called resids
. I’ll let you
figure out the code.
What happens when you calculate the sum of the resids
column?
I got the following:
sum(dat_random$resids)
## [1] -5.359858
Notice that the negative and positive residuals partially cancel out. How would I calculate the sum of the absolute values of the residuals?
abs()
function.A fitted values vs. residuals scatterplot is worth a thousand words:
A histogram of residual values is also helpful:
Note on pasted code!
If you create your report in Word, Google Docs, etc, make sure you include runnable code in your answers.
When you include snippets of code, you must copy your code from a
text editor (such as the script window in RStudio). - If you copy your
code from the R Console, you will copy the
>
characters at the beginning of the
lines and it won’t run directly when pasted into the console (i.e. when
we grade your code).
This is OK:
norm_17 = ....your code here...
This is incorrect:
> norm_17 = ....your code here...
norm_17
, norm_30
, norm_300
, and
norm_3000
.norm_17
should have 17 elements, norm_30
should have 30 elements, norm_300
should have 300 elements,
and norm_3000
should have 3000 elements.Hint: I need to be able to run your code on a fresh R session on my computer. You should test your answers in a fresh R session.
Hint: You should save the parameters you use for mean and standard deviation into variables so that they are easy to re-use (and easy to change later, if needed).
norm_17
, norm_30
, norm_300
, and
norm_3000
.lab_04_hist_01.png
.res
argument int he
help entry for png()
.lab_04_hist_01.png
file to Moodle.
Make sure you double check the image size and resolution
requirements.Q6: (1 pt) What are the parameters and their values for the standard Normal distribution?
# Generate a vector of x-values
x = seq(-6, 6, length.out = 1000)
y = dnorm(x)
plot(x, y, main = "Standard Normal PDF", type = "l", xlim = c(-3, 3))
abline(h = 0)
xlim
values?norm_1.svg
.
svg()
function to create your image.svg()
pdf()
instead. Just make sure you name you file
norm_1.pdf
instead of norm_1.svg
. The width
and height arguments do not have the same units in
pdf()
and svg()
. svg()
uses dpi
while pdf()
uses inches.dnorm()
for a description
of the arguments.norm_1.svg
(or
norm_1.pdf
).For reference, here is how my figure turned out:
Review the section of the lab instructions in which I created a
data.frame
of random data.
Experiment with different sets of randomly-generated data:
rnorm
and runif
. R has pre-made
random number generating functions for lots of other distributions.set.seed()
with different seed values.rgb()
and adjustcolor()
may be
helpful.png()
,
pdf()
, and svg()
for info.Instructions:
line_point_slope()
function
# Calculates the value of y for a linear function, given the coordinates
# of a known point (x1, y1) and the slope of the line.
line_point_slope = function(x, x1, y1, slope)
{
get_y_intercept =
function(x1, y1, slope)
return(-(x1 * slope) + y1)
linear =
function(x, yint, slope)
return(yint + x * slope)
return(linear(x, get_y_intercept(x1, y1, slope), slope))
}
png()
,
pdf()
, and svg()
for info.Instructions:
Make sure your figure has appropriate title, legend, and other annotation as needed.
Compile your answers to all 14 questions into a pdf document and submit via Moodle.
You must upload the four required figure image files to get full credit:
You can check out my saving images to file tutorial for guidance.